(sh-mode-map): Don't remap
[emacs.git] / src / coding.c
blob9dde90af15ed1b1df38b43ba71ba9b7112969176
1 /* Coding system handler (conversion, detection, and etc).
2 Copyright (C) 1995, 1997, 1998, 2002 Electrotechnical Laboratory, JAPAN.
3 Licensed to the Free Software Foundation.
4 Copyright (C) 2001,2002 Free Software Foundation, Inc.
6 This file is part of GNU Emacs.
8 GNU Emacs is free software; you can redistribute it and/or modify
9 it under the terms of the GNU General Public License as published by
10 the Free Software Foundation; either version 2, or (at your option)
11 any later version.
13 GNU Emacs is distributed in the hope that it will be useful,
14 but WITHOUT ANY WARRANTY; without even the implied warranty of
15 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16 GNU General Public License for more details.
18 You should have received a copy of the GNU General Public License
19 along with GNU Emacs; see the file COPYING. If not, write to
20 the Free Software Foundation, Inc., 59 Temple Place - Suite 330,
21 Boston, MA 02111-1307, USA. */
23 /*** TABLE OF CONTENTS ***
25 0. General comments
26 1. Preamble
27 2. Emacs' internal format (emacs-mule) handlers
28 3. ISO2022 handlers
29 4. Shift-JIS and BIG5 handlers
30 5. CCL handlers
31 6. End-of-line handlers
32 7. C library functions
33 8. Emacs Lisp library functions
34 9. Post-amble
38 /*** 0. General comments ***/
41 /*** GENERAL NOTE on CODING SYSTEMS ***
43 A coding system is an encoding mechanism for one or more character
44 sets. Here's a list of coding systems which Emacs can handle. When
45 we say "decode", it means converting some other coding system to
46 Emacs' internal format (emacs-mule), and when we say "encode",
47 it means converting the coding system emacs-mule to some other
48 coding system.
50 0. Emacs' internal format (emacs-mule)
52 Emacs itself holds a multi-lingual character in buffers and strings
53 in a special format. Details are described in section 2.
55 1. ISO2022
57 The most famous coding system for multiple character sets. X's
58 Compound Text, various EUCs (Extended Unix Code), and coding
59 systems used in Internet communication such as ISO-2022-JP are
60 all variants of ISO2022. Details are described in section 3.
62 2. SJIS (or Shift-JIS or MS-Kanji-Code)
64 A coding system to encode character sets: ASCII, JISX0201, and
65 JISX0208. Widely used for PC's in Japan. Details are described in
66 section 4.
68 3. BIG5
70 A coding system to encode the character sets ASCII and Big5. Widely
71 used for Chinese (mainly in Taiwan and Hong Kong). Details are
72 described in section 4. In this file, when we write "BIG5"
73 (all uppercase), we mean the coding system, and when we write
74 "Big5" (capitalized), we mean the character set.
76 4. Raw text
78 A coding system for text containing random 8-bit code. Emacs does
79 no code conversion on such text except for end-of-line format.
81 5. Other
83 If a user wants to read/write text encoded in a coding system not
84 listed above, he can supply a decoder and an encoder for it as CCL
85 (Code Conversion Language) programs. Emacs executes the CCL program
86 while reading/writing.
88 Emacs represents a coding system by a Lisp symbol that has a property
89 `coding-system'. But, before actually using the coding system, the
90 information about it is set in a structure of type `struct
91 coding_system' for rapid processing. See section 6 for more details.
95 /*** GENERAL NOTES on END-OF-LINE FORMAT ***
97 How end-of-line of text is encoded depends on the operating system.
98 For instance, Unix's format is just one byte of `line-feed' code,
99 whereas DOS's format is two-byte sequence of `carriage-return' and
100 `line-feed' codes. MacOS's format is usually one byte of
101 `carriage-return'.
103 Since text character encoding and end-of-line encoding are
104 independent, any coding system described above can have any
105 end-of-line format. So Emacs has information about end-of-line
106 format in each coding-system. See section 6 for more details.
110 /*** GENERAL NOTES on `detect_coding_XXX ()' functions ***
112 These functions check if a text between SRC and SRC_END is encoded
113 in the coding system category XXX. Each returns an integer value in
114 which appropriate flag bits for the category XXX are set. The flag
115 bits are defined in macros CODING_CATEGORY_MASK_XXX. Below is the
116 template for these functions. If MULTIBYTEP is nonzero, 8-bit codes
117 of the range 0x80..0x9F are in multibyte form. */
118 #if 0
120 detect_coding_emacs_mule (src, src_end, multibytep)
121 unsigned char *src, *src_end;
122 int multibytep;
126 #endif
128 /*** GENERAL NOTES on `decode_coding_XXX ()' functions ***
130 These functions decode SRC_BYTES length of unibyte text at SOURCE
131 encoded in CODING to Emacs' internal format. The resulting
132 multibyte text goes to a place pointed to by DESTINATION, the length
133 of which should not exceed DST_BYTES.
135 These functions set the information about original and decoded texts
136 in the members `produced', `produced_char', `consumed', and
137 `consumed_char' of the structure *CODING. They also set the member
138 `result' to one of CODING_FINISH_XXX indicating how the decoding
139 finished.
141 DST_BYTES zero means that the source area and destination area are
142 overlapped, which means that we can produce a decoded text until it
143 reaches the head of the not-yet-decoded source text.
145 Below is a template for these functions. */
146 #if 0
147 static void
148 decode_coding_XXX (coding, source, destination, src_bytes, dst_bytes)
149 struct coding_system *coding;
150 unsigned char *source, *destination;
151 int src_bytes, dst_bytes;
155 #endif
157 /*** GENERAL NOTES on `encode_coding_XXX ()' functions ***
159 These functions encode SRC_BYTES length text at SOURCE from Emacs'
160 internal multibyte format to CODING. The resulting unibyte text
161 goes to a place pointed to by DESTINATION, the length of which
162 should not exceed DST_BYTES.
164 These functions set the information about original and encoded texts
165 in the members `produced', `produced_char', `consumed', and
166 `consumed_char' of the structure *CODING. They also set the member
167 `result' to one of CODING_FINISH_XXX indicating how the encoding
168 finished.
170 DST_BYTES zero means that the source area and destination area are
171 overlapped, which means that we can produce encoded text until it
172 reaches at the head of the not-yet-encoded source text.
174 Below is a template for these functions. */
175 #if 0
176 static void
177 encode_coding_XXX (coding, source, destination, src_bytes, dst_bytes)
178 struct coding_system *coding;
179 unsigned char *source, *destination;
180 int src_bytes, dst_bytes;
184 #endif
186 /*** COMMONLY USED MACROS ***/
188 /* The following two macros ONE_MORE_BYTE and TWO_MORE_BYTES safely
189 get one, two, and three bytes from the source text respectively.
190 If there are not enough bytes in the source, they jump to
191 `label_end_of_loop'. The caller should set variables `coding',
192 `src' and `src_end' to appropriate pointer in advance. These
193 macros are called from decoding routines `decode_coding_XXX', thus
194 it is assumed that the source text is unibyte. */
196 #define ONE_MORE_BYTE(c1) \
197 do { \
198 if (src >= src_end) \
200 coding->result = CODING_FINISH_INSUFFICIENT_SRC; \
201 goto label_end_of_loop; \
203 c1 = *src++; \
204 } while (0)
206 #define TWO_MORE_BYTES(c1, c2) \
207 do { \
208 if (src + 1 >= src_end) \
210 coding->result = CODING_FINISH_INSUFFICIENT_SRC; \
211 goto label_end_of_loop; \
213 c1 = *src++; \
214 c2 = *src++; \
215 } while (0)
218 /* Like ONE_MORE_BYTE, but 8-bit bytes of data at SRC are in multibyte
219 form if MULTIBYTEP is nonzero. */
221 #define ONE_MORE_BYTE_CHECK_MULTIBYTE(c1, multibytep) \
222 do { \
223 if (src >= src_end) \
225 coding->result = CODING_FINISH_INSUFFICIENT_SRC; \
226 goto label_end_of_loop; \
228 c1 = *src++; \
229 if (multibytep && c1 == LEADING_CODE_8_BIT_CONTROL) \
230 c1 = *src++ - 0x20; \
231 } while (0)
233 /* Set C to the next character at the source text pointed by `src'.
234 If there are not enough characters in the source, jump to
235 `label_end_of_loop'. The caller should set variables `coding'
236 `src', `src_end', and `translation_table' to appropriate pointers
237 in advance. This macro is used in encoding routines
238 `encode_coding_XXX', thus it assumes that the source text is in
239 multibyte form except for 8-bit characters. 8-bit characters are
240 in multibyte form if coding->src_multibyte is nonzero, else they
241 are represented by a single byte. */
243 #define ONE_MORE_CHAR(c) \
244 do { \
245 int len = src_end - src; \
246 int bytes; \
247 if (len <= 0) \
249 coding->result = CODING_FINISH_INSUFFICIENT_SRC; \
250 goto label_end_of_loop; \
252 if (coding->src_multibyte \
253 || UNIBYTE_STR_AS_MULTIBYTE_P (src, len, bytes)) \
254 c = STRING_CHAR_AND_LENGTH (src, len, bytes); \
255 else \
256 c = *src, bytes = 1; \
257 if (!NILP (translation_table)) \
258 c = translate_char (translation_table, c, -1, 0, 0); \
259 src += bytes; \
260 } while (0)
263 /* Produce a multibyte form of character C to `dst'. Jump to
264 `label_end_of_loop' if there's not enough space at `dst'.
266 If we are now in the middle of a composition sequence, the decoded
267 character may be ALTCHAR (for the current composition). In that
268 case, the character goes to coding->cmp_data->data instead of
269 `dst'.
271 This macro is used in decoding routines. */
273 #define EMIT_CHAR(c) \
274 do { \
275 if (! COMPOSING_P (coding) \
276 || coding->composing == COMPOSITION_RELATIVE \
277 || coding->composing == COMPOSITION_WITH_RULE) \
279 int bytes = CHAR_BYTES (c); \
280 if ((dst + bytes) > (dst_bytes ? dst_end : src)) \
282 coding->result = CODING_FINISH_INSUFFICIENT_DST; \
283 goto label_end_of_loop; \
285 dst += CHAR_STRING (c, dst); \
286 coding->produced_char++; \
289 if (COMPOSING_P (coding) \
290 && coding->composing != COMPOSITION_RELATIVE) \
292 CODING_ADD_COMPOSITION_COMPONENT (coding, c); \
293 coding->composition_rule_follows \
294 = coding->composing != COMPOSITION_WITH_ALTCHARS; \
296 } while (0)
299 #define EMIT_ONE_BYTE(c) \
300 do { \
301 if (dst >= (dst_bytes ? dst_end : src)) \
303 coding->result = CODING_FINISH_INSUFFICIENT_DST; \
304 goto label_end_of_loop; \
306 *dst++ = c; \
307 } while (0)
309 #define EMIT_TWO_BYTES(c1, c2) \
310 do { \
311 if (dst + 2 > (dst_bytes ? dst_end : src)) \
313 coding->result = CODING_FINISH_INSUFFICIENT_DST; \
314 goto label_end_of_loop; \
316 *dst++ = c1, *dst++ = c2; \
317 } while (0)
319 #define EMIT_BYTES(from, to) \
320 do { \
321 if (dst + (to - from) > (dst_bytes ? dst_end : src)) \
323 coding->result = CODING_FINISH_INSUFFICIENT_DST; \
324 goto label_end_of_loop; \
326 while (from < to) \
327 *dst++ = *from++; \
328 } while (0)
331 /*** 1. Preamble ***/
333 #ifdef emacs
334 #include <config.h>
335 #endif
337 #include <stdio.h>
339 #ifdef emacs
341 #include "lisp.h"
342 #include "buffer.h"
343 #include "charset.h"
344 #include "composite.h"
345 #include "ccl.h"
346 #include "coding.h"
347 #include "window.h"
349 #else /* not emacs */
351 #include "mulelib.h"
353 #endif /* not emacs */
355 Lisp_Object Qcoding_system, Qeol_type;
356 Lisp_Object Qbuffer_file_coding_system;
357 Lisp_Object Qpost_read_conversion, Qpre_write_conversion;
358 Lisp_Object Qno_conversion, Qundecided;
359 Lisp_Object Qcoding_system_history;
360 Lisp_Object Qsafe_chars;
361 Lisp_Object Qvalid_codes;
363 extern Lisp_Object Qinsert_file_contents, Qwrite_region;
364 Lisp_Object Qcall_process, Qcall_process_region, Qprocess_argument;
365 Lisp_Object Qstart_process, Qopen_network_stream;
366 Lisp_Object Qtarget_idx;
368 Lisp_Object Vselect_safe_coding_system_function;
370 int coding_system_require_warning;
372 /* Mnemonic string for each format of end-of-line. */
373 Lisp_Object eol_mnemonic_unix, eol_mnemonic_dos, eol_mnemonic_mac;
374 /* Mnemonic string to indicate format of end-of-line is not yet
375 decided. */
376 Lisp_Object eol_mnemonic_undecided;
378 /* Format of end-of-line decided by system. This is CODING_EOL_LF on
379 Unix, CODING_EOL_CRLF on DOS/Windows, and CODING_EOL_CR on Mac. */
380 int system_eol_type;
382 #ifdef emacs
384 /* Information about which coding system is safe for which chars.
385 The value has the form (GENERIC-LIST . NON-GENERIC-ALIST).
387 GENERIC-LIST is a list of generic coding systems which can encode
388 any characters.
390 NON-GENERIC-ALIST is an alist of non generic coding systems vs the
391 corresponding char table that contains safe chars. */
392 Lisp_Object Vcoding_system_safe_chars;
394 Lisp_Object Vcoding_system_list, Vcoding_system_alist;
396 Lisp_Object Qcoding_system_p, Qcoding_system_error;
398 /* Coding system emacs-mule and raw-text are for converting only
399 end-of-line format. */
400 Lisp_Object Qemacs_mule, Qraw_text;
402 /* Coding-systems are handed between Emacs Lisp programs and C internal
403 routines by the following three variables. */
404 /* Coding-system for reading files and receiving data from process. */
405 Lisp_Object Vcoding_system_for_read;
406 /* Coding-system for writing files and sending data to process. */
407 Lisp_Object Vcoding_system_for_write;
408 /* Coding-system actually used in the latest I/O. */
409 Lisp_Object Vlast_coding_system_used;
411 /* A vector of length 256 which contains information about special
412 Latin codes (especially for dealing with Microsoft codes). */
413 Lisp_Object Vlatin_extra_code_table;
415 /* Flag to inhibit code conversion of end-of-line format. */
416 int inhibit_eol_conversion;
418 /* Flag to inhibit ISO2022 escape sequence detection. */
419 int inhibit_iso_escape_detection;
421 /* Flag to make buffer-file-coding-system inherit from process-coding. */
422 int inherit_process_coding_system;
424 /* Coding system to be used to encode text for terminal display. */
425 struct coding_system terminal_coding;
427 /* Coding system to be used to encode text for terminal display when
428 terminal coding system is nil. */
429 struct coding_system safe_terminal_coding;
431 /* Coding system of what is sent from terminal keyboard. */
432 struct coding_system keyboard_coding;
434 /* Default coding system to be used to write a file. */
435 struct coding_system default_buffer_file_coding;
437 Lisp_Object Vfile_coding_system_alist;
438 Lisp_Object Vprocess_coding_system_alist;
439 Lisp_Object Vnetwork_coding_system_alist;
441 Lisp_Object Vlocale_coding_system;
443 #endif /* emacs */
445 Lisp_Object Qcoding_category, Qcoding_category_index;
447 /* List of symbols `coding-category-xxx' ordered by priority. */
448 Lisp_Object Vcoding_category_list;
450 /* Table of coding categories (Lisp symbols). */
451 Lisp_Object Vcoding_category_table;
453 /* Table of names of symbol for each coding-category. */
454 char *coding_category_name[CODING_CATEGORY_IDX_MAX] = {
455 "coding-category-emacs-mule",
456 "coding-category-sjis",
457 "coding-category-iso-7",
458 "coding-category-iso-7-tight",
459 "coding-category-iso-8-1",
460 "coding-category-iso-8-2",
461 "coding-category-iso-7-else",
462 "coding-category-iso-8-else",
463 "coding-category-ccl",
464 "coding-category-big5",
465 "coding-category-utf-8",
466 "coding-category-utf-16-be",
467 "coding-category-utf-16-le",
468 "coding-category-raw-text",
469 "coding-category-binary"
472 /* Table of pointers to coding systems corresponding to each coding
473 categories. */
474 struct coding_system *coding_system_table[CODING_CATEGORY_IDX_MAX];
476 /* Table of coding category masks. Nth element is a mask for a coding
477 category of which priority is Nth. */
478 static
479 int coding_priorities[CODING_CATEGORY_IDX_MAX];
481 /* Flag to tell if we look up translation table on character code
482 conversion. */
483 Lisp_Object Venable_character_translation;
484 /* Standard translation table to look up on decoding (reading). */
485 Lisp_Object Vstandard_translation_table_for_decode;
486 /* Standard translation table to look up on encoding (writing). */
487 Lisp_Object Vstandard_translation_table_for_encode;
489 Lisp_Object Qtranslation_table;
490 Lisp_Object Qtranslation_table_id;
491 Lisp_Object Qtranslation_table_for_decode;
492 Lisp_Object Qtranslation_table_for_encode;
494 /* Alist of charsets vs revision number. */
495 Lisp_Object Vcharset_revision_alist;
497 /* Default coding systems used for process I/O. */
498 Lisp_Object Vdefault_process_coding_system;
500 /* Char table for translating Quail and self-inserting input. */
501 Lisp_Object Vtranslation_table_for_input;
503 /* Global flag to tell that we can't call post-read-conversion and
504 pre-write-conversion functions. Usually the value is zero, but it
505 is set to 1 temporarily while such functions are running. This is
506 to avoid infinite recursive call. */
507 static int inhibit_pre_post_conversion;
509 /* Char-table containing safe coding systems of each character. */
510 Lisp_Object Vchar_coding_system_table;
511 Lisp_Object Qchar_coding_system;
513 /* Return `safe-chars' property of CODING_SYSTEM (symbol). Don't check
514 its validity. */
516 Lisp_Object
517 coding_safe_chars (coding_system)
518 Lisp_Object coding_system;
520 Lisp_Object coding_spec, plist, safe_chars;
522 coding_spec = Fget (coding_system, Qcoding_system);
523 plist = XVECTOR (coding_spec)->contents[3];
524 safe_chars = Fplist_get (XVECTOR (coding_spec)->contents[3], Qsafe_chars);
525 return (CHAR_TABLE_P (safe_chars) ? safe_chars : Qt);
528 #define CODING_SAFE_CHAR_P(safe_chars, c) \
529 (EQ (safe_chars, Qt) || !NILP (CHAR_TABLE_REF (safe_chars, c)))
532 /*** 2. Emacs internal format (emacs-mule) handlers ***/
534 /* Emacs' internal format for representation of multiple character
535 sets is a kind of multi-byte encoding, i.e. characters are
536 represented by variable-length sequences of one-byte codes.
538 ASCII characters and control characters (e.g. `tab', `newline') are
539 represented by one-byte sequences which are their ASCII codes, in
540 the range 0x00 through 0x7F.
542 8-bit characters of the range 0x80..0x9F are represented by
543 two-byte sequences of LEADING_CODE_8_BIT_CONTROL and (their 8-bit
544 code + 0x20).
546 8-bit characters of the range 0xA0..0xFF are represented by
547 one-byte sequences which are their 8-bit code.
549 The other characters are represented by a sequence of `base
550 leading-code', optional `extended leading-code', and one or two
551 `position-code's. The length of the sequence is determined by the
552 base leading-code. Leading-code takes the range 0x81 through 0x9D,
553 whereas extended leading-code and position-code take the range 0xA0
554 through 0xFF. See `charset.h' for more details about leading-code
555 and position-code.
557 --- CODE RANGE of Emacs' internal format ---
558 character set range
559 ------------- -----
560 ascii 0x00..0x7F
561 eight-bit-control LEADING_CODE_8_BIT_CONTROL + 0xA0..0xBF
562 eight-bit-graphic 0xA0..0xBF
563 ELSE 0x81..0x9D + [0xA0..0xFF]+
564 ---------------------------------------------
566 As this is the internal character representation, the format is
567 usually not used externally (i.e. in a file or in a data sent to a
568 process). But, it is possible to have a text externally in this
569 format (i.e. by encoding by the coding system `emacs-mule').
571 In that case, a sequence of one-byte codes has a slightly different
572 form.
574 Firstly, all characters in eight-bit-control are represented by
575 one-byte sequences which are their 8-bit code.
577 Next, character composition data are represented by the byte
578 sequence of the form: 0x80 METHOD BYTES CHARS COMPONENT ...,
579 where,
580 METHOD is 0xF0 plus one of composition method (enum
581 composition_method),
583 BYTES is 0xA0 plus the byte length of these composition data,
585 CHARS is 0xA0 plus the number of characters composed by these
586 data,
588 COMPONENTs are characters of multibyte form or composition
589 rules encoded by two-byte of ASCII codes.
591 In addition, for backward compatibility, the following formats are
592 also recognized as composition data on decoding.
594 0x80 MSEQ ...
595 0x80 0xFF MSEQ RULE MSEQ RULE ... MSEQ
597 Here,
598 MSEQ is a multibyte form but in these special format:
599 ASCII: 0xA0 ASCII_CODE+0x80,
600 other: LEADING_CODE+0x20 FOLLOWING-BYTE ...,
601 RULE is a one byte code of the range 0xA0..0xF0 that
602 represents a composition rule.
605 enum emacs_code_class_type emacs_code_class[256];
607 /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions".
608 Check if a text is encoded in Emacs' internal format. If it is,
609 return CODING_CATEGORY_MASK_EMACS_MULE, else return 0. */
611 static int
612 detect_coding_emacs_mule (src, src_end, multibytep)
613 unsigned char *src, *src_end;
614 int multibytep;
616 unsigned char c;
617 int composing = 0;
618 /* Dummy for ONE_MORE_BYTE. */
619 struct coding_system dummy_coding;
620 struct coding_system *coding = &dummy_coding;
622 while (1)
624 ONE_MORE_BYTE_CHECK_MULTIBYTE (c, multibytep);
626 if (composing)
628 if (c < 0xA0)
629 composing = 0;
630 else if (c == 0xA0)
632 ONE_MORE_BYTE_CHECK_MULTIBYTE (c, multibytep);
633 c &= 0x7F;
635 else
636 c -= 0x20;
639 if (c < 0x20)
641 if (c == ISO_CODE_ESC || c == ISO_CODE_SI || c == ISO_CODE_SO)
642 return 0;
644 else if (c >= 0x80 && c < 0xA0)
646 if (c == 0x80)
647 /* Old leading code for a composite character. */
648 composing = 1;
649 else
651 unsigned char *src_base = src - 1;
652 int bytes;
654 if (!UNIBYTE_STR_AS_MULTIBYTE_P (src_base, src_end - src_base,
655 bytes))
656 return 0;
657 src = src_base + bytes;
661 label_end_of_loop:
662 return CODING_CATEGORY_MASK_EMACS_MULE;
666 /* Record the starting position START and METHOD of one composition. */
668 #define CODING_ADD_COMPOSITION_START(coding, start, method) \
669 do { \
670 struct composition_data *cmp_data = coding->cmp_data; \
671 int *data = cmp_data->data + cmp_data->used; \
672 coding->cmp_data_start = cmp_data->used; \
673 data[0] = -1; \
674 data[1] = cmp_data->char_offset + start; \
675 data[3] = (int) method; \
676 cmp_data->used += 4; \
677 } while (0)
679 /* Record the ending position END of the current composition. */
681 #define CODING_ADD_COMPOSITION_END(coding, end) \
682 do { \
683 struct composition_data *cmp_data = coding->cmp_data; \
684 int *data = cmp_data->data + coding->cmp_data_start; \
685 data[0] = cmp_data->used - coding->cmp_data_start; \
686 data[2] = cmp_data->char_offset + end; \
687 } while (0)
689 /* Record one COMPONENT (alternate character or composition rule). */
691 #define CODING_ADD_COMPOSITION_COMPONENT(coding, component) \
692 (coding->cmp_data->data[coding->cmp_data->used++] = component)
695 /* Get one byte from a data pointed by SRC and increment SRC. If SRC
696 is not less than SRC_END, return -1 without incrementing Src. */
698 #define SAFE_ONE_MORE_BYTE() (src >= src_end ? -1 : *src++)
701 /* Decode a character represented as a component of composition
702 sequence of Emacs 20 style at SRC. Set C to that character, store
703 its multibyte form sequence at P, and set P to the end of that
704 sequence. If no valid character is found, set C to -1. */
706 #define DECODE_EMACS_MULE_COMPOSITION_CHAR(c, p) \
707 do { \
708 int bytes; \
710 c = SAFE_ONE_MORE_BYTE (); \
711 if (c < 0) \
712 break; \
713 if (CHAR_HEAD_P (c)) \
714 c = -1; \
715 else if (c == 0xA0) \
717 c = SAFE_ONE_MORE_BYTE (); \
718 if (c < 0xA0) \
719 c = -1; \
720 else \
722 c -= 0xA0; \
723 *p++ = c; \
726 else if (BASE_LEADING_CODE_P (c - 0x20)) \
728 unsigned char *p0 = p; \
730 c -= 0x20; \
731 *p++ = c; \
732 bytes = BYTES_BY_CHAR_HEAD (c); \
733 while (--bytes) \
735 c = SAFE_ONE_MORE_BYTE (); \
736 if (c < 0) \
737 break; \
738 *p++ = c; \
740 if (UNIBYTE_STR_AS_MULTIBYTE_P (p0, p - p0, bytes)) \
741 c = STRING_CHAR (p0, bytes); \
742 else \
743 c = -1; \
745 else \
746 c = -1; \
747 } while (0)
750 /* Decode a composition rule represented as a component of composition
751 sequence of Emacs 20 style at SRC. Set C to the rule. If not
752 valid rule is found, set C to -1. */
754 #define DECODE_EMACS_MULE_COMPOSITION_RULE(c) \
755 do { \
756 c = SAFE_ONE_MORE_BYTE (); \
757 c -= 0xA0; \
758 if (c < 0 || c >= 81) \
759 c = -1; \
760 else \
762 gref = c / 9, nref = c % 9; \
763 c = COMPOSITION_ENCODE_RULE (gref, nref); \
765 } while (0)
768 /* Decode composition sequence encoded by `emacs-mule' at the source
769 pointed by SRC. SRC_END is the end of source. Store information
770 of the composition in CODING->cmp_data.
772 For backward compatibility, decode also a composition sequence of
773 Emacs 20 style. In that case, the composition sequence contains
774 characters that should be extracted into a buffer or string. Store
775 those characters at *DESTINATION in multibyte form.
777 If we encounter an invalid byte sequence, return 0.
778 If we encounter an insufficient source or destination, or
779 insufficient space in CODING->cmp_data, return 1.
780 Otherwise, return consumed bytes in the source.
783 static INLINE int
784 decode_composition_emacs_mule (coding, src, src_end,
785 destination, dst_end, dst_bytes)
786 struct coding_system *coding;
787 unsigned char *src, *src_end, **destination, *dst_end;
788 int dst_bytes;
790 unsigned char *dst = *destination;
791 int method, data_len, nchars;
792 unsigned char *src_base = src++;
793 /* Store components of composition. */
794 int component[COMPOSITION_DATA_MAX_BUNCH_LENGTH];
795 int ncomponent;
796 /* Store multibyte form of characters to be composed. This is for
797 Emacs 20 style composition sequence. */
798 unsigned char buf[MAX_COMPOSITION_COMPONENTS * MAX_MULTIBYTE_LENGTH];
799 unsigned char *bufp = buf;
800 int c, i, gref, nref;
802 if (coding->cmp_data->used + COMPOSITION_DATA_MAX_BUNCH_LENGTH
803 >= COMPOSITION_DATA_SIZE)
805 coding->result = CODING_FINISH_INSUFFICIENT_CMP;
806 return -1;
809 ONE_MORE_BYTE (c);
810 if (c - 0xF0 >= COMPOSITION_RELATIVE
811 && c - 0xF0 <= COMPOSITION_WITH_RULE_ALTCHARS)
813 int with_rule;
815 method = c - 0xF0;
816 with_rule = (method == COMPOSITION_WITH_RULE
817 || method == COMPOSITION_WITH_RULE_ALTCHARS);
818 ONE_MORE_BYTE (c);
819 data_len = c - 0xA0;
820 if (data_len < 4
821 || src_base + data_len > src_end)
822 return 0;
823 ONE_MORE_BYTE (c);
824 nchars = c - 0xA0;
825 if (c < 1)
826 return 0;
827 for (ncomponent = 0; src < src_base + data_len; ncomponent++)
829 /* If it is longer than this, it can't be valid. */
830 if (ncomponent >= COMPOSITION_DATA_MAX_BUNCH_LENGTH)
831 return 0;
833 if (ncomponent % 2 && with_rule)
835 ONE_MORE_BYTE (gref);
836 gref -= 32;
837 ONE_MORE_BYTE (nref);
838 nref -= 32;
839 c = COMPOSITION_ENCODE_RULE (gref, nref);
841 else
843 int bytes;
844 if (UNIBYTE_STR_AS_MULTIBYTE_P (src, src_end - src, bytes))
845 c = STRING_CHAR (src, bytes);
846 else
847 c = *src, bytes = 1;
848 src += bytes;
850 component[ncomponent] = c;
853 else
855 /* This may be an old Emacs 20 style format. See the comment at
856 the section 2 of this file. */
857 while (src < src_end && !CHAR_HEAD_P (*src)) src++;
858 if (src == src_end
859 && !(coding->mode & CODING_MODE_LAST_BLOCK))
860 goto label_end_of_loop;
862 src_end = src;
863 src = src_base + 1;
864 if (c < 0xC0)
866 method = COMPOSITION_RELATIVE;
867 for (ncomponent = 0; ncomponent < MAX_COMPOSITION_COMPONENTS;)
869 DECODE_EMACS_MULE_COMPOSITION_CHAR (c, bufp);
870 if (c < 0)
871 break;
872 component[ncomponent++] = c;
874 if (ncomponent < 2)
875 return 0;
876 nchars = ncomponent;
878 else if (c == 0xFF)
880 method = COMPOSITION_WITH_RULE;
881 src++;
882 DECODE_EMACS_MULE_COMPOSITION_CHAR (c, bufp);
883 if (c < 0)
884 return 0;
885 component[0] = c;
886 for (ncomponent = 1;
887 ncomponent < MAX_COMPOSITION_COMPONENTS * 2 - 1;)
889 DECODE_EMACS_MULE_COMPOSITION_RULE (c);
890 if (c < 0)
891 break;
892 component[ncomponent++] = c;
893 DECODE_EMACS_MULE_COMPOSITION_CHAR (c, bufp);
894 if (c < 0)
895 break;
896 component[ncomponent++] = c;
898 if (ncomponent < 3)
899 return 0;
900 nchars = (ncomponent + 1) / 2;
902 else
903 return 0;
906 if (buf == bufp || dst + (bufp - buf) <= (dst_bytes ? dst_end : src))
908 CODING_ADD_COMPOSITION_START (coding, coding->produced_char, method);
909 for (i = 0; i < ncomponent; i++)
910 CODING_ADD_COMPOSITION_COMPONENT (coding, component[i]);
911 CODING_ADD_COMPOSITION_END (coding, coding->produced_char + nchars);
912 if (buf < bufp)
914 unsigned char *p = buf;
915 EMIT_BYTES (p, bufp);
916 *destination += bufp - buf;
917 coding->produced_char += nchars;
919 return (src - src_base);
921 label_end_of_loop:
922 return -1;
925 /* See the above "GENERAL NOTES on `decode_coding_XXX ()' functions". */
927 static void
928 decode_coding_emacs_mule (coding, source, destination, src_bytes, dst_bytes)
929 struct coding_system *coding;
930 unsigned char *source, *destination;
931 int src_bytes, dst_bytes;
933 unsigned char *src = source;
934 unsigned char *src_end = source + src_bytes;
935 unsigned char *dst = destination;
936 unsigned char *dst_end = destination + dst_bytes;
937 /* SRC_BASE remembers the start position in source in each loop.
938 The loop will be exited when there's not enough source code, or
939 when there's not enough destination area to produce a
940 character. */
941 unsigned char *src_base;
943 coding->produced_char = 0;
944 while ((src_base = src) < src_end)
946 unsigned char tmp[MAX_MULTIBYTE_LENGTH], *p;
947 int bytes;
949 if (*src == '\r')
951 int c = *src++;
953 if (coding->eol_type == CODING_EOL_CR)
954 c = '\n';
955 else if (coding->eol_type == CODING_EOL_CRLF)
957 ONE_MORE_BYTE (c);
958 if (c != '\n')
960 src--;
961 c = '\r';
964 *dst++ = c;
965 coding->produced_char++;
966 continue;
968 else if (*src == '\n')
970 if ((coding->eol_type == CODING_EOL_CR
971 || coding->eol_type == CODING_EOL_CRLF)
972 && coding->mode & CODING_MODE_INHIBIT_INCONSISTENT_EOL)
974 coding->result = CODING_FINISH_INCONSISTENT_EOL;
975 goto label_end_of_loop;
977 *dst++ = *src++;
978 coding->produced_char++;
979 continue;
981 else if (*src == 0x80 && coding->cmp_data)
983 /* Start of composition data. */
984 int consumed = decode_composition_emacs_mule (coding, src, src_end,
985 &dst, dst_end,
986 dst_bytes);
987 if (consumed < 0)
988 goto label_end_of_loop;
989 else if (consumed > 0)
991 src += consumed;
992 continue;
994 bytes = CHAR_STRING (*src, tmp);
995 p = tmp;
996 src++;
998 else if (UNIBYTE_STR_AS_MULTIBYTE_P (src, src_end - src, bytes))
1000 p = src;
1001 src += bytes;
1003 else
1005 bytes = CHAR_STRING (*src, tmp);
1006 p = tmp;
1007 src++;
1009 if (dst + bytes >= (dst_bytes ? dst_end : src))
1011 coding->result = CODING_FINISH_INSUFFICIENT_DST;
1012 break;
1014 while (bytes--) *dst++ = *p++;
1015 coding->produced_char++;
1017 label_end_of_loop:
1018 coding->consumed = coding->consumed_char = src_base - source;
1019 coding->produced = dst - destination;
1023 /* Encode composition data stored at DATA into a special byte sequence
1024 starting by 0x80. Update CODING->cmp_data_start and maybe
1025 CODING->cmp_data for the next call. */
1027 #define ENCODE_COMPOSITION_EMACS_MULE(coding, data) \
1028 do { \
1029 unsigned char buf[1024], *p0 = buf, *p; \
1030 int len = data[0]; \
1031 int i; \
1033 buf[0] = 0x80; \
1034 buf[1] = 0xF0 + data[3]; /* METHOD */ \
1035 buf[3] = 0xA0 + (data[2] - data[1]); /* COMPOSED-CHARS */ \
1036 p = buf + 4; \
1037 if (data[3] == COMPOSITION_WITH_RULE \
1038 || data[3] == COMPOSITION_WITH_RULE_ALTCHARS) \
1040 p += CHAR_STRING (data[4], p); \
1041 for (i = 5; i < len; i += 2) \
1043 int gref, nref; \
1044 COMPOSITION_DECODE_RULE (data[i], gref, nref); \
1045 *p++ = 0x20 + gref; \
1046 *p++ = 0x20 + nref; \
1047 p += CHAR_STRING (data[i + 1], p); \
1050 else \
1052 for (i = 4; i < len; i++) \
1053 p += CHAR_STRING (data[i], p); \
1055 buf[2] = 0xA0 + (p - buf); /* COMPONENTS-BYTES */ \
1057 if (dst + (p - buf) + 4 > (dst_bytes ? dst_end : src)) \
1059 coding->result = CODING_FINISH_INSUFFICIENT_DST; \
1060 goto label_end_of_loop; \
1062 while (p0 < p) \
1063 *dst++ = *p0++; \
1064 coding->cmp_data_start += data[0]; \
1065 if (coding->cmp_data_start == coding->cmp_data->used \
1066 && coding->cmp_data->next) \
1068 coding->cmp_data = coding->cmp_data->next; \
1069 coding->cmp_data_start = 0; \
1071 } while (0)
1074 static void encode_eol P_ ((struct coding_system *, const unsigned char *,
1075 unsigned char *, int, int));
1077 static void
1078 encode_coding_emacs_mule (coding, source, destination, src_bytes, dst_bytes)
1079 struct coding_system *coding;
1080 unsigned char *source, *destination;
1081 int src_bytes, dst_bytes;
1083 unsigned char *src = source;
1084 unsigned char *src_end = source + src_bytes;
1085 unsigned char *dst = destination;
1086 unsigned char *dst_end = destination + dst_bytes;
1087 unsigned char *src_base;
1088 int c;
1089 int char_offset;
1090 int *data;
1092 Lisp_Object translation_table;
1094 translation_table = Qnil;
1096 /* Optimization for the case that there's no composition. */
1097 if (!coding->cmp_data || coding->cmp_data->used == 0)
1099 encode_eol (coding, source, destination, src_bytes, dst_bytes);
1100 return;
1103 char_offset = coding->cmp_data->char_offset;
1104 data = coding->cmp_data->data + coding->cmp_data_start;
1105 while (1)
1107 src_base = src;
1109 /* If SRC starts a composition, encode the information about the
1110 composition in advance. */
1111 if (coding->cmp_data_start < coding->cmp_data->used
1112 && char_offset + coding->consumed_char == data[1])
1114 ENCODE_COMPOSITION_EMACS_MULE (coding, data);
1115 char_offset = coding->cmp_data->char_offset;
1116 data = coding->cmp_data->data + coding->cmp_data_start;
1119 ONE_MORE_CHAR (c);
1120 if (c == '\n' && (coding->eol_type == CODING_EOL_CRLF
1121 || coding->eol_type == CODING_EOL_CR))
1123 if (coding->eol_type == CODING_EOL_CRLF)
1124 EMIT_TWO_BYTES ('\r', c);
1125 else
1126 EMIT_ONE_BYTE ('\r');
1128 else if (SINGLE_BYTE_CHAR_P (c))
1129 EMIT_ONE_BYTE (c);
1130 else
1131 EMIT_BYTES (src_base, src);
1132 coding->consumed_char++;
1134 label_end_of_loop:
1135 coding->consumed = src_base - source;
1136 coding->produced = coding->produced_char = dst - destination;
1137 return;
1141 /*** 3. ISO2022 handlers ***/
1143 /* The following note describes the coding system ISO2022 briefly.
1144 Since the intention of this note is to help understand the
1145 functions in this file, some parts are NOT ACCURATE or are OVERLY
1146 SIMPLIFIED. For thorough understanding, please refer to the
1147 original document of ISO2022. This is equivalent to the standard
1148 ECMA-35, obtainable from <URL:http://www.ecma.ch/> (*).
1150 ISO2022 provides many mechanisms to encode several character sets
1151 in 7-bit and 8-bit environments. For 7-bit environments, all text
1152 is encoded using bytes less than 128. This may make the encoded
1153 text a little bit longer, but the text passes more easily through
1154 several types of gateway, some of which strip off the MSB (Most
1155 Significant Bit).
1157 There are two kinds of character sets: control character sets and
1158 graphic character sets. The former contain control characters such
1159 as `newline' and `escape' to provide control functions (control
1160 functions are also provided by escape sequences). The latter
1161 contain graphic characters such as 'A' and '-'. Emacs recognizes
1162 two control character sets and many graphic character sets.
1164 Graphic character sets are classified into one of the following
1165 four classes, according to the number of bytes (DIMENSION) and
1166 number of characters in one dimension (CHARS) of the set:
1167 - DIMENSION1_CHARS94
1168 - DIMENSION1_CHARS96
1169 - DIMENSION2_CHARS94
1170 - DIMENSION2_CHARS96
1172 In addition, each character set is assigned an identification tag,
1173 unique for each set, called the "final character" (denoted as <F>
1174 hereafter). The <F> of each character set is decided by ECMA(*)
1175 when it is registered in ISO. The code range of <F> is 0x30..0x7F
1176 (0x30..0x3F are for private use only).
1178 Note (*): ECMA = European Computer Manufacturers Association
1180 Here are examples of graphic character sets [NAME(<F>)]:
1181 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ...
1182 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ...
1183 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ...
1184 o DIMENSION2_CHARS96 -- none for the moment
1186 A code area (1 byte=8 bits) is divided into 4 areas, C0, GL, C1, and GR.
1187 C0 [0x00..0x1F] -- control character plane 0
1188 GL [0x20..0x7F] -- graphic character plane 0
1189 C1 [0x80..0x9F] -- control character plane 1
1190 GR [0xA0..0xFF] -- graphic character plane 1
1192 A control character set is directly designated and invoked to C0 or
1193 C1 by an escape sequence. The most common case is that:
1194 - ISO646's control character set is designated/invoked to C0, and
1195 - ISO6429's control character set is designated/invoked to C1,
1196 and usually these designations/invocations are omitted in encoded
1197 text. In a 7-bit environment, only C0 can be used, and a control
1198 character for C1 is encoded by an appropriate escape sequence to
1199 fit into the environment. All control characters for C1 are
1200 defined to have corresponding escape sequences.
1202 A graphic character set is at first designated to one of four
1203 graphic registers (G0 through G3), then these graphic registers are
1204 invoked to GL or GR. These designations and invocations can be
1205 done independently. The most common case is that G0 is invoked to
1206 GL, G1 is invoked to GR, and ASCII is designated to G0. Usually
1207 these invocations and designations are omitted in encoded text.
1208 In a 7-bit environment, only GL can be used.
1210 When a graphic character set of CHARS94 is invoked to GL, codes
1211 0x20 and 0x7F of the GL area work as control characters SPACE and
1212 DEL respectively, and codes 0xA0 and 0xFF of the GR area should not
1213 be used.
1215 There are two ways of invocation: locking-shift and single-shift.
1216 With locking-shift, the invocation lasts until the next different
1217 invocation, whereas with single-shift, the invocation affects the
1218 following character only and doesn't affect the locking-shift
1219 state. Invocations are done by the following control characters or
1220 escape sequences:
1222 ----------------------------------------------------------------------
1223 abbrev function cntrl escape seq description
1224 ----------------------------------------------------------------------
1225 SI/LS0 (shift-in) 0x0F none invoke G0 into GL
1226 SO/LS1 (shift-out) 0x0E none invoke G1 into GL
1227 LS2 (locking-shift-2) none ESC 'n' invoke G2 into GL
1228 LS3 (locking-shift-3) none ESC 'o' invoke G3 into GL
1229 LS1R (locking-shift-1 right) none ESC '~' invoke G1 into GR (*)
1230 LS2R (locking-shift-2 right) none ESC '}' invoke G2 into GR (*)
1231 LS3R (locking-shift 3 right) none ESC '|' invoke G3 into GR (*)
1232 SS2 (single-shift-2) 0x8E ESC 'N' invoke G2 for one char
1233 SS3 (single-shift-3) 0x8F ESC 'O' invoke G3 for one char
1234 ----------------------------------------------------------------------
1235 (*) These are not used by any known coding system.
1237 Control characters for these functions are defined by macros
1238 ISO_CODE_XXX in `coding.h'.
1240 Designations are done by the following escape sequences:
1241 ----------------------------------------------------------------------
1242 escape sequence description
1243 ----------------------------------------------------------------------
1244 ESC '(' <F> designate DIMENSION1_CHARS94<F> to G0
1245 ESC ')' <F> designate DIMENSION1_CHARS94<F> to G1
1246 ESC '*' <F> designate DIMENSION1_CHARS94<F> to G2
1247 ESC '+' <F> designate DIMENSION1_CHARS94<F> to G3
1248 ESC ',' <F> designate DIMENSION1_CHARS96<F> to G0 (*)
1249 ESC '-' <F> designate DIMENSION1_CHARS96<F> to G1
1250 ESC '.' <F> designate DIMENSION1_CHARS96<F> to G2
1251 ESC '/' <F> designate DIMENSION1_CHARS96<F> to G3
1252 ESC '$' '(' <F> designate DIMENSION2_CHARS94<F> to G0 (**)
1253 ESC '$' ')' <F> designate DIMENSION2_CHARS94<F> to G1
1254 ESC '$' '*' <F> designate DIMENSION2_CHARS94<F> to G2
1255 ESC '$' '+' <F> designate DIMENSION2_CHARS94<F> to G3
1256 ESC '$' ',' <F> designate DIMENSION2_CHARS96<F> to G0 (*)
1257 ESC '$' '-' <F> designate DIMENSION2_CHARS96<F> to G1
1258 ESC '$' '.' <F> designate DIMENSION2_CHARS96<F> to G2
1259 ESC '$' '/' <F> designate DIMENSION2_CHARS96<F> to G3
1260 ----------------------------------------------------------------------
1262 In this list, "DIMENSION1_CHARS94<F>" means a graphic character set
1263 of dimension 1, chars 94, and final character <F>, etc...
1265 Note (*): Although these designations are not allowed in ISO2022,
1266 Emacs accepts them on decoding, and produces them on encoding
1267 CHARS96 character sets in a coding system which is characterized as
1268 7-bit environment, non-locking-shift, and non-single-shift.
1270 Note (**): If <F> is '@', 'A', or 'B', the intermediate character
1271 '(' can be omitted. We refer to this as "short-form" hereafter.
1273 Now you may notice that there are a lot of ways of encoding the
1274 same multilingual text in ISO2022. Actually, there exist many
1275 coding systems such as Compound Text (used in X11's inter client
1276 communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR
1277 (used in Korean Internet), EUC (Extended UNIX Code, used in Asian
1278 localized platforms), and all of these are variants of ISO2022.
1280 In addition to the above, Emacs handles two more kinds of escape
1281 sequences: ISO6429's direction specification and Emacs' private
1282 sequence for specifying character composition.
1284 ISO6429's direction specification takes the following form:
1285 o CSI ']' -- end of the current direction
1286 o CSI '0' ']' -- end of the current direction
1287 o CSI '1' ']' -- start of left-to-right text
1288 o CSI '2' ']' -- start of right-to-left text
1289 The control character CSI (0x9B: control sequence introducer) is
1290 abbreviated to the escape sequence ESC '[' in a 7-bit environment.
1292 Character composition specification takes the following form:
1293 o ESC '0' -- start relative composition
1294 o ESC '1' -- end composition
1295 o ESC '2' -- start rule-base composition (*)
1296 o ESC '3' -- start relative composition with alternate chars (**)
1297 o ESC '4' -- start rule-base composition with alternate chars (**)
1298 Since these are not standard escape sequences of any ISO standard,
1299 the use of them with these meanings is restricted to Emacs only.
1301 (*) This form is used only in Emacs 20.5 and older versions,
1302 but the newer versions can safely decode it.
1303 (**) This form is used only in Emacs 21.1 and newer versions,
1304 and the older versions can't decode it.
1306 Here's a list of example usages of these composition escape
1307 sequences (categorized by `enum composition_method').
1309 COMPOSITION_RELATIVE:
1310 ESC 0 CHAR [ CHAR ] ESC 1
1311 COMPOSITION_WITH_RULE:
1312 ESC 2 CHAR [ RULE CHAR ] ESC 1
1313 COMPOSITION_WITH_ALTCHARS:
1314 ESC 3 ALTCHAR [ ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1
1315 COMPOSITION_WITH_RULE_ALTCHARS:
1316 ESC 4 ALTCHAR [ RULE ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 */
1318 enum iso_code_class_type iso_code_class[256];
1320 #define CHARSET_OK(idx, charset, c) \
1321 (coding_system_table[idx] \
1322 && (charset == CHARSET_ASCII \
1323 || (safe_chars = coding_safe_chars (coding_system_table[idx]->symbol), \
1324 CODING_SAFE_CHAR_P (safe_chars, c))) \
1325 && (CODING_SPEC_ISO_REQUESTED_DESIGNATION (coding_system_table[idx], \
1326 charset) \
1327 != CODING_SPEC_ISO_NO_REQUESTED_DESIGNATION))
1329 #define SHIFT_OUT_OK(idx) \
1330 (CODING_SPEC_ISO_INITIAL_DESIGNATION (coding_system_table[idx], 1) >= 0)
1332 /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions".
1333 Check if a text is encoded in ISO2022. If it is, return an
1334 integer in which appropriate flag bits any of:
1335 CODING_CATEGORY_MASK_ISO_7
1336 CODING_CATEGORY_MASK_ISO_7_TIGHT
1337 CODING_CATEGORY_MASK_ISO_8_1
1338 CODING_CATEGORY_MASK_ISO_8_2
1339 CODING_CATEGORY_MASK_ISO_7_ELSE
1340 CODING_CATEGORY_MASK_ISO_8_ELSE
1341 are set. If a code which should never appear in ISO2022 is found,
1342 returns 0. */
1344 static int
1345 detect_coding_iso2022 (src, src_end, multibytep)
1346 unsigned char *src, *src_end;
1347 int multibytep;
1349 int mask = CODING_CATEGORY_MASK_ISO;
1350 int mask_found = 0;
1351 int reg[4], shift_out = 0, single_shifting = 0;
1352 int c, c1, charset;
1353 /* Dummy for ONE_MORE_BYTE. */
1354 struct coding_system dummy_coding;
1355 struct coding_system *coding = &dummy_coding;
1356 Lisp_Object safe_chars;
1358 reg[0] = CHARSET_ASCII, reg[1] = reg[2] = reg[3] = -1;
1359 while (mask && src < src_end)
1361 ONE_MORE_BYTE_CHECK_MULTIBYTE (c, multibytep);
1362 retry:
1363 switch (c)
1365 case ISO_CODE_ESC:
1366 if (inhibit_iso_escape_detection)
1367 break;
1368 single_shifting = 0;
1369 ONE_MORE_BYTE_CHECK_MULTIBYTE (c, multibytep);
1370 if (c >= '(' && c <= '/')
1372 /* Designation sequence for a charset of dimension 1. */
1373 ONE_MORE_BYTE_CHECK_MULTIBYTE (c1, multibytep);
1374 if (c1 < ' ' || c1 >= 0x80
1375 || (charset = iso_charset_table[0][c >= ','][c1]) < 0)
1376 /* Invalid designation sequence. Just ignore. */
1377 break;
1378 reg[(c - '(') % 4] = charset;
1380 else if (c == '$')
1382 /* Designation sequence for a charset of dimension 2. */
1383 ONE_MORE_BYTE_CHECK_MULTIBYTE (c, multibytep);
1384 if (c >= '@' && c <= 'B')
1385 /* Designation for JISX0208.1978, GB2312, or JISX0208. */
1386 reg[0] = charset = iso_charset_table[1][0][c];
1387 else if (c >= '(' && c <= '/')
1389 ONE_MORE_BYTE_CHECK_MULTIBYTE (c1, multibytep);
1390 if (c1 < ' ' || c1 >= 0x80
1391 || (charset = iso_charset_table[1][c >= ','][c1]) < 0)
1392 /* Invalid designation sequence. Just ignore. */
1393 break;
1394 reg[(c - '(') % 4] = charset;
1396 else
1397 /* Invalid designation sequence. Just ignore. */
1398 break;
1400 else if (c == 'N' || c == 'O')
1402 /* ESC <Fe> for SS2 or SS3. */
1403 mask &= CODING_CATEGORY_MASK_ISO_7_ELSE;
1404 break;
1406 else if (c >= '0' && c <= '4')
1408 /* ESC <Fp> for start/end composition. */
1409 mask_found |= CODING_CATEGORY_MASK_ISO;
1410 break;
1412 else
1413 /* Invalid escape sequence. Just ignore. */
1414 break;
1416 /* We found a valid designation sequence for CHARSET. */
1417 mask &= ~CODING_CATEGORY_MASK_ISO_8BIT;
1418 c = MAKE_CHAR (charset, 0, 0);
1419 if (CHARSET_OK (CODING_CATEGORY_IDX_ISO_7, charset, c))
1420 mask_found |= CODING_CATEGORY_MASK_ISO_7;
1421 else
1422 mask &= ~CODING_CATEGORY_MASK_ISO_7;
1423 if (CHARSET_OK (CODING_CATEGORY_IDX_ISO_7_TIGHT, charset, c))
1424 mask_found |= CODING_CATEGORY_MASK_ISO_7_TIGHT;
1425 else
1426 mask &= ~CODING_CATEGORY_MASK_ISO_7_TIGHT;
1427 if (CHARSET_OK (CODING_CATEGORY_IDX_ISO_7_ELSE, charset, c))
1428 mask_found |= CODING_CATEGORY_MASK_ISO_7_ELSE;
1429 else
1430 mask &= ~CODING_CATEGORY_MASK_ISO_7_ELSE;
1431 if (CHARSET_OK (CODING_CATEGORY_IDX_ISO_8_ELSE, charset, c))
1432 mask_found |= CODING_CATEGORY_MASK_ISO_8_ELSE;
1433 else
1434 mask &= ~CODING_CATEGORY_MASK_ISO_8_ELSE;
1435 break;
1437 case ISO_CODE_SO:
1438 if (inhibit_iso_escape_detection)
1439 break;
1440 single_shifting = 0;
1441 if (shift_out == 0
1442 && (reg[1] >= 0
1443 || SHIFT_OUT_OK (CODING_CATEGORY_IDX_ISO_7_ELSE)
1444 || SHIFT_OUT_OK (CODING_CATEGORY_IDX_ISO_8_ELSE)))
1446 /* Locking shift out. */
1447 mask &= ~CODING_CATEGORY_MASK_ISO_7BIT;
1448 mask_found |= CODING_CATEGORY_MASK_ISO_SHIFT;
1450 break;
1452 case ISO_CODE_SI:
1453 if (inhibit_iso_escape_detection)
1454 break;
1455 single_shifting = 0;
1456 if (shift_out == 1)
1458 /* Locking shift in. */
1459 mask &= ~CODING_CATEGORY_MASK_ISO_7BIT;
1460 mask_found |= CODING_CATEGORY_MASK_ISO_SHIFT;
1462 break;
1464 case ISO_CODE_CSI:
1465 single_shifting = 0;
1466 case ISO_CODE_SS2:
1467 case ISO_CODE_SS3:
1469 int newmask = CODING_CATEGORY_MASK_ISO_8_ELSE;
1471 if (inhibit_iso_escape_detection)
1472 break;
1473 if (c != ISO_CODE_CSI)
1475 if (coding_system_table[CODING_CATEGORY_IDX_ISO_8_1]->flags
1476 & CODING_FLAG_ISO_SINGLE_SHIFT)
1477 newmask |= CODING_CATEGORY_MASK_ISO_8_1;
1478 if (coding_system_table[CODING_CATEGORY_IDX_ISO_8_2]->flags
1479 & CODING_FLAG_ISO_SINGLE_SHIFT)
1480 newmask |= CODING_CATEGORY_MASK_ISO_8_2;
1481 single_shifting = 1;
1483 if (VECTORP (Vlatin_extra_code_table)
1484 && !NILP (XVECTOR (Vlatin_extra_code_table)->contents[c]))
1486 if (coding_system_table[CODING_CATEGORY_IDX_ISO_8_1]->flags
1487 & CODING_FLAG_ISO_LATIN_EXTRA)
1488 newmask |= CODING_CATEGORY_MASK_ISO_8_1;
1489 if (coding_system_table[CODING_CATEGORY_IDX_ISO_8_2]->flags
1490 & CODING_FLAG_ISO_LATIN_EXTRA)
1491 newmask |= CODING_CATEGORY_MASK_ISO_8_2;
1493 mask &= newmask;
1494 mask_found |= newmask;
1496 break;
1498 default:
1499 if (c < 0x80)
1501 single_shifting = 0;
1502 break;
1504 else if (c < 0xA0)
1506 single_shifting = 0;
1507 if (VECTORP (Vlatin_extra_code_table)
1508 && !NILP (XVECTOR (Vlatin_extra_code_table)->contents[c]))
1510 int newmask = 0;
1512 if (coding_system_table[CODING_CATEGORY_IDX_ISO_8_1]->flags
1513 & CODING_FLAG_ISO_LATIN_EXTRA)
1514 newmask |= CODING_CATEGORY_MASK_ISO_8_1;
1515 if (coding_system_table[CODING_CATEGORY_IDX_ISO_8_2]->flags
1516 & CODING_FLAG_ISO_LATIN_EXTRA)
1517 newmask |= CODING_CATEGORY_MASK_ISO_8_2;
1518 mask &= newmask;
1519 mask_found |= newmask;
1521 else
1522 return 0;
1524 else
1526 mask &= ~(CODING_CATEGORY_MASK_ISO_7BIT
1527 | CODING_CATEGORY_MASK_ISO_7_ELSE);
1528 mask_found |= CODING_CATEGORY_MASK_ISO_8_1;
1529 /* Check the length of succeeding codes of the range
1530 0xA0..0FF. If the byte length is odd, we exclude
1531 CODING_CATEGORY_MASK_ISO_8_2. We can check this only
1532 when we are not single shifting. */
1533 if (!single_shifting
1534 && mask & CODING_CATEGORY_MASK_ISO_8_2)
1536 int i = 1;
1538 c = -1;
1539 while (src < src_end)
1541 ONE_MORE_BYTE_CHECK_MULTIBYTE (c, multibytep);
1542 if (c < 0xA0)
1543 break;
1544 i++;
1547 if (i & 1 && src < src_end)
1548 mask &= ~CODING_CATEGORY_MASK_ISO_8_2;
1549 else
1550 mask_found |= CODING_CATEGORY_MASK_ISO_8_2;
1551 if (c >= 0)
1552 /* This means that we have read one extra byte. */
1553 goto retry;
1556 break;
1559 label_end_of_loop:
1560 return (mask & mask_found);
1563 /* Decode a character of which charset is CHARSET, the 1st position
1564 code is C1, the 2nd position code is C2, and return the decoded
1565 character code. If the variable `translation_table' is non-nil,
1566 returned the translated code. */
1568 #define DECODE_ISO_CHARACTER(charset, c1, c2) \
1569 (NILP (translation_table) \
1570 ? MAKE_CHAR (charset, c1, c2) \
1571 : translate_char (translation_table, -1, charset, c1, c2))
1573 /* Set designation state into CODING. */
1574 #define DECODE_DESIGNATION(reg, dimension, chars, final_char) \
1575 do { \
1576 int charset, c; \
1578 if (final_char < '0' || final_char >= 128) \
1579 goto label_invalid_code; \
1580 charset = ISO_CHARSET_TABLE (make_number (dimension), \
1581 make_number (chars), \
1582 make_number (final_char)); \
1583 c = MAKE_CHAR (charset, 0, 0); \
1584 if (charset >= 0 \
1585 && (CODING_SPEC_ISO_REQUESTED_DESIGNATION (coding, charset) == reg \
1586 || CODING_SAFE_CHAR_P (safe_chars, c))) \
1588 if (coding->spec.iso2022.last_invalid_designation_register == 0 \
1589 && reg == 0 \
1590 && charset == CHARSET_ASCII) \
1592 /* We should insert this designation sequence as is so \
1593 that it is surely written back to a file. */ \
1594 coding->spec.iso2022.last_invalid_designation_register = -1; \
1595 goto label_invalid_code; \
1597 coding->spec.iso2022.last_invalid_designation_register = -1; \
1598 if ((coding->mode & CODING_MODE_DIRECTION) \
1599 && CHARSET_REVERSE_CHARSET (charset) >= 0) \
1600 charset = CHARSET_REVERSE_CHARSET (charset); \
1601 CODING_SPEC_ISO_DESIGNATION (coding, reg) = charset; \
1603 else \
1605 coding->spec.iso2022.last_invalid_designation_register = reg; \
1606 goto label_invalid_code; \
1608 } while (0)
1610 /* Allocate a memory block for storing information about compositions.
1611 The block is chained to the already allocated blocks. */
1613 void
1614 coding_allocate_composition_data (coding, char_offset)
1615 struct coding_system *coding;
1616 int char_offset;
1618 struct composition_data *cmp_data
1619 = (struct composition_data *) xmalloc (sizeof *cmp_data);
1621 cmp_data->char_offset = char_offset;
1622 cmp_data->used = 0;
1623 cmp_data->prev = coding->cmp_data;
1624 cmp_data->next = NULL;
1625 if (coding->cmp_data)
1626 coding->cmp_data->next = cmp_data;
1627 coding->cmp_data = cmp_data;
1628 coding->cmp_data_start = 0;
1631 /* Handle composition start sequence ESC 0, ESC 2, ESC 3, or ESC 4.
1632 ESC 0 : relative composition : ESC 0 CHAR ... ESC 1
1633 ESC 2 : rulebase composition : ESC 2 CHAR RULE CHAR RULE ... CHAR ESC 1
1634 ESC 3 : altchar composition : ESC 3 ALT ... ESC 0 CHAR ... ESC 1
1635 ESC 4 : alt&rule composition : ESC 4 ALT RULE .. ALT ESC 0 CHAR ... ESC 1
1638 #define DECODE_COMPOSITION_START(c1) \
1639 do { \
1640 if (coding->composing == COMPOSITION_DISABLED) \
1642 *dst++ = ISO_CODE_ESC; \
1643 *dst++ = c1 & 0x7f; \
1644 coding->produced_char += 2; \
1646 else if (!COMPOSING_P (coding)) \
1648 /* This is surely the start of a composition. We must be sure \
1649 that coding->cmp_data has enough space to store the \
1650 information about the composition. If not, terminate the \
1651 current decoding loop, allocate one more memory block for \
1652 coding->cmp_data in the caller, then start the decoding \
1653 loop again. We can't allocate memory here directly because \
1654 it may cause buffer/string relocation. */ \
1655 if (!coding->cmp_data \
1656 || (coding->cmp_data->used + COMPOSITION_DATA_MAX_BUNCH_LENGTH \
1657 >= COMPOSITION_DATA_SIZE)) \
1659 coding->result = CODING_FINISH_INSUFFICIENT_CMP; \
1660 goto label_end_of_loop; \
1662 coding->composing = (c1 == '0' ? COMPOSITION_RELATIVE \
1663 : c1 == '2' ? COMPOSITION_WITH_RULE \
1664 : c1 == '3' ? COMPOSITION_WITH_ALTCHARS \
1665 : COMPOSITION_WITH_RULE_ALTCHARS); \
1666 CODING_ADD_COMPOSITION_START (coding, coding->produced_char, \
1667 coding->composing); \
1668 coding->composition_rule_follows = 0; \
1670 else \
1672 /* We are already handling a composition. If the method is \
1673 the following two, the codes following the current escape \
1674 sequence are actual characters stored in a buffer. */ \
1675 if (coding->composing == COMPOSITION_WITH_ALTCHARS \
1676 || coding->composing == COMPOSITION_WITH_RULE_ALTCHARS) \
1678 coding->composing = COMPOSITION_RELATIVE; \
1679 coding->composition_rule_follows = 0; \
1682 } while (0)
1684 /* Handle composition end sequence ESC 1. */
1686 #define DECODE_COMPOSITION_END(c1) \
1687 do { \
1688 if (! COMPOSING_P (coding)) \
1690 *dst++ = ISO_CODE_ESC; \
1691 *dst++ = c1; \
1692 coding->produced_char += 2; \
1694 else \
1696 CODING_ADD_COMPOSITION_END (coding, coding->produced_char); \
1697 coding->composing = COMPOSITION_NO; \
1699 } while (0)
1701 /* Decode a composition rule from the byte C1 (and maybe one more byte
1702 from SRC) and store one encoded composition rule in
1703 coding->cmp_data. */
1705 #define DECODE_COMPOSITION_RULE(c1) \
1706 do { \
1707 int rule = 0; \
1708 (c1) -= 32; \
1709 if (c1 < 81) /* old format (before ver.21) */ \
1711 int gref = (c1) / 9; \
1712 int nref = (c1) % 9; \
1713 if (gref == 4) gref = 10; \
1714 if (nref == 4) nref = 10; \
1715 rule = COMPOSITION_ENCODE_RULE (gref, nref); \
1717 else if (c1 < 93) /* new format (after ver.21) */ \
1719 ONE_MORE_BYTE (c2); \
1720 rule = COMPOSITION_ENCODE_RULE (c1 - 81, c2 - 32); \
1722 CODING_ADD_COMPOSITION_COMPONENT (coding, rule); \
1723 coding->composition_rule_follows = 0; \
1724 } while (0)
1727 /* See the above "GENERAL NOTES on `decode_coding_XXX ()' functions". */
1729 static void
1730 decode_coding_iso2022 (coding, source, destination, src_bytes, dst_bytes)
1731 struct coding_system *coding;
1732 unsigned char *source, *destination;
1733 int src_bytes, dst_bytes;
1735 unsigned char *src = source;
1736 unsigned char *src_end = source + src_bytes;
1737 unsigned char *dst = destination;
1738 unsigned char *dst_end = destination + dst_bytes;
1739 /* Charsets invoked to graphic plane 0 and 1 respectively. */
1740 int charset0 = CODING_SPEC_ISO_PLANE_CHARSET (coding, 0);
1741 int charset1 = CODING_SPEC_ISO_PLANE_CHARSET (coding, 1);
1742 /* SRC_BASE remembers the start position in source in each loop.
1743 The loop will be exited when there's not enough source code
1744 (within macro ONE_MORE_BYTE), or when there's not enough
1745 destination area to produce a character (within macro
1746 EMIT_CHAR). */
1747 unsigned char *src_base;
1748 int c, charset;
1749 Lisp_Object translation_table;
1750 Lisp_Object safe_chars;
1752 safe_chars = coding_safe_chars (coding->symbol);
1754 if (NILP (Venable_character_translation))
1755 translation_table = Qnil;
1756 else
1758 translation_table = coding->translation_table_for_decode;
1759 if (NILP (translation_table))
1760 translation_table = Vstandard_translation_table_for_decode;
1763 coding->result = CODING_FINISH_NORMAL;
1765 while (1)
1767 int c1, c2;
1769 src_base = src;
1770 ONE_MORE_BYTE (c1);
1772 /* We produce no character or one character. */
1773 switch (iso_code_class [c1])
1775 case ISO_0x20_or_0x7F:
1776 if (COMPOSING_P (coding) && coding->composition_rule_follows)
1778 DECODE_COMPOSITION_RULE (c1);
1779 continue;
1781 if (charset0 < 0 || CHARSET_CHARS (charset0) == 94)
1783 /* This is SPACE or DEL. */
1784 charset = CHARSET_ASCII;
1785 break;
1787 /* This is a graphic character, we fall down ... */
1789 case ISO_graphic_plane_0:
1790 if (COMPOSING_P (coding) && coding->composition_rule_follows)
1792 DECODE_COMPOSITION_RULE (c1);
1793 continue;
1795 charset = charset0;
1796 break;
1798 case ISO_0xA0_or_0xFF:
1799 if (charset1 < 0 || CHARSET_CHARS (charset1) == 94
1800 || coding->flags & CODING_FLAG_ISO_SEVEN_BITS)
1801 goto label_invalid_code;
1802 /* This is a graphic character, we fall down ... */
1804 case ISO_graphic_plane_1:
1805 if (charset1 < 0)
1806 goto label_invalid_code;
1807 charset = charset1;
1808 break;
1810 case ISO_control_0:
1811 if (COMPOSING_P (coding))
1812 DECODE_COMPOSITION_END ('1');
1814 /* All ISO2022 control characters in this class have the
1815 same representation in Emacs internal format. */
1816 if (c1 == '\n'
1817 && (coding->mode & CODING_MODE_INHIBIT_INCONSISTENT_EOL)
1818 && (coding->eol_type == CODING_EOL_CR
1819 || coding->eol_type == CODING_EOL_CRLF))
1821 coding->result = CODING_FINISH_INCONSISTENT_EOL;
1822 goto label_end_of_loop;
1824 charset = CHARSET_ASCII;
1825 break;
1827 case ISO_control_1:
1828 if (COMPOSING_P (coding))
1829 DECODE_COMPOSITION_END ('1');
1830 goto label_invalid_code;
1832 case ISO_carriage_return:
1833 if (COMPOSING_P (coding))
1834 DECODE_COMPOSITION_END ('1');
1836 if (coding->eol_type == CODING_EOL_CR)
1837 c1 = '\n';
1838 else if (coding->eol_type == CODING_EOL_CRLF)
1840 ONE_MORE_BYTE (c1);
1841 if (c1 != ISO_CODE_LF)
1843 src--;
1844 c1 = '\r';
1847 charset = CHARSET_ASCII;
1848 break;
1850 case ISO_shift_out:
1851 if (! (coding->flags & CODING_FLAG_ISO_LOCKING_SHIFT)
1852 || CODING_SPEC_ISO_DESIGNATION (coding, 1) < 0)
1853 goto label_invalid_code;
1854 CODING_SPEC_ISO_INVOCATION (coding, 0) = 1;
1855 charset0 = CODING_SPEC_ISO_PLANE_CHARSET (coding, 0);
1856 continue;
1858 case ISO_shift_in:
1859 if (! (coding->flags & CODING_FLAG_ISO_LOCKING_SHIFT))
1860 goto label_invalid_code;
1861 CODING_SPEC_ISO_INVOCATION (coding, 0) = 0;
1862 charset0 = CODING_SPEC_ISO_PLANE_CHARSET (coding, 0);
1863 continue;
1865 case ISO_single_shift_2_7:
1866 case ISO_single_shift_2:
1867 if (! (coding->flags & CODING_FLAG_ISO_SINGLE_SHIFT))
1868 goto label_invalid_code;
1869 /* SS2 is handled as an escape sequence of ESC 'N' */
1870 c1 = 'N';
1871 goto label_escape_sequence;
1873 case ISO_single_shift_3:
1874 if (! (coding->flags & CODING_FLAG_ISO_SINGLE_SHIFT))
1875 goto label_invalid_code;
1876 /* SS2 is handled as an escape sequence of ESC 'O' */
1877 c1 = 'O';
1878 goto label_escape_sequence;
1880 case ISO_control_sequence_introducer:
1881 /* CSI is handled as an escape sequence of ESC '[' ... */
1882 c1 = '[';
1883 goto label_escape_sequence;
1885 case ISO_escape:
1886 ONE_MORE_BYTE (c1);
1887 label_escape_sequence:
1888 /* Escape sequences handled by Emacs are invocation,
1889 designation, direction specification, and character
1890 composition specification. */
1891 switch (c1)
1893 case '&': /* revision of following character set */
1894 ONE_MORE_BYTE (c1);
1895 if (!(c1 >= '@' && c1 <= '~'))
1896 goto label_invalid_code;
1897 ONE_MORE_BYTE (c1);
1898 if (c1 != ISO_CODE_ESC)
1899 goto label_invalid_code;
1900 ONE_MORE_BYTE (c1);
1901 goto label_escape_sequence;
1903 case '$': /* designation of 2-byte character set */
1904 if (! (coding->flags & CODING_FLAG_ISO_DESIGNATION))
1905 goto label_invalid_code;
1906 ONE_MORE_BYTE (c1);
1907 if (c1 >= '@' && c1 <= 'B')
1908 { /* designation of JISX0208.1978, GB2312.1980,
1909 or JISX0208.1980 */
1910 DECODE_DESIGNATION (0, 2, 94, c1);
1912 else if (c1 >= 0x28 && c1 <= 0x2B)
1913 { /* designation of DIMENSION2_CHARS94 character set */
1914 ONE_MORE_BYTE (c2);
1915 DECODE_DESIGNATION (c1 - 0x28, 2, 94, c2);
1917 else if (c1 >= 0x2C && c1 <= 0x2F)
1918 { /* designation of DIMENSION2_CHARS96 character set */
1919 ONE_MORE_BYTE (c2);
1920 DECODE_DESIGNATION (c1 - 0x2C, 2, 96, c2);
1922 else
1923 goto label_invalid_code;
1924 /* We must update these variables now. */
1925 charset0 = CODING_SPEC_ISO_PLANE_CHARSET (coding, 0);
1926 charset1 = CODING_SPEC_ISO_PLANE_CHARSET (coding, 1);
1927 continue;
1929 case 'n': /* invocation of locking-shift-2 */
1930 if (! (coding->flags & CODING_FLAG_ISO_LOCKING_SHIFT)
1931 || CODING_SPEC_ISO_DESIGNATION (coding, 2) < 0)
1932 goto label_invalid_code;
1933 CODING_SPEC_ISO_INVOCATION (coding, 0) = 2;
1934 charset0 = CODING_SPEC_ISO_PLANE_CHARSET (coding, 0);
1935 continue;
1937 case 'o': /* invocation of locking-shift-3 */
1938 if (! (coding->flags & CODING_FLAG_ISO_LOCKING_SHIFT)
1939 || CODING_SPEC_ISO_DESIGNATION (coding, 3) < 0)
1940 goto label_invalid_code;
1941 CODING_SPEC_ISO_INVOCATION (coding, 0) = 3;
1942 charset0 = CODING_SPEC_ISO_PLANE_CHARSET (coding, 0);
1943 continue;
1945 case 'N': /* invocation of single-shift-2 */
1946 if (! (coding->flags & CODING_FLAG_ISO_SINGLE_SHIFT)
1947 || CODING_SPEC_ISO_DESIGNATION (coding, 2) < 0)
1948 goto label_invalid_code;
1949 charset = CODING_SPEC_ISO_DESIGNATION (coding, 2);
1950 ONE_MORE_BYTE (c1);
1951 if (c1 < 0x20 || (c1 >= 0x80 && c1 < 0xA0))
1952 goto label_invalid_code;
1953 break;
1955 case 'O': /* invocation of single-shift-3 */
1956 if (! (coding->flags & CODING_FLAG_ISO_SINGLE_SHIFT)
1957 || CODING_SPEC_ISO_DESIGNATION (coding, 3) < 0)
1958 goto label_invalid_code;
1959 charset = CODING_SPEC_ISO_DESIGNATION (coding, 3);
1960 ONE_MORE_BYTE (c1);
1961 if (c1 < 0x20 || (c1 >= 0x80 && c1 < 0xA0))
1962 goto label_invalid_code;
1963 break;
1965 case '0': case '2': case '3': case '4': /* start composition */
1966 DECODE_COMPOSITION_START (c1);
1967 continue;
1969 case '1': /* end composition */
1970 DECODE_COMPOSITION_END (c1);
1971 continue;
1973 case '[': /* specification of direction */
1974 if (coding->flags & CODING_FLAG_ISO_NO_DIRECTION)
1975 goto label_invalid_code;
1976 /* For the moment, nested direction is not supported.
1977 So, `coding->mode & CODING_MODE_DIRECTION' zero means
1978 left-to-right, and nonzero means right-to-left. */
1979 ONE_MORE_BYTE (c1);
1980 switch (c1)
1982 case ']': /* end of the current direction */
1983 coding->mode &= ~CODING_MODE_DIRECTION;
1985 case '0': /* end of the current direction */
1986 case '1': /* start of left-to-right direction */
1987 ONE_MORE_BYTE (c1);
1988 if (c1 == ']')
1989 coding->mode &= ~CODING_MODE_DIRECTION;
1990 else
1991 goto label_invalid_code;
1992 break;
1994 case '2': /* start of right-to-left direction */
1995 ONE_MORE_BYTE (c1);
1996 if (c1 == ']')
1997 coding->mode |= CODING_MODE_DIRECTION;
1998 else
1999 goto label_invalid_code;
2000 break;
2002 default:
2003 goto label_invalid_code;
2005 continue;
2007 default:
2008 if (! (coding->flags & CODING_FLAG_ISO_DESIGNATION))
2009 goto label_invalid_code;
2010 if (c1 >= 0x28 && c1 <= 0x2B)
2011 { /* designation of DIMENSION1_CHARS94 character set */
2012 ONE_MORE_BYTE (c2);
2013 DECODE_DESIGNATION (c1 - 0x28, 1, 94, c2);
2015 else if (c1 >= 0x2C && c1 <= 0x2F)
2016 { /* designation of DIMENSION1_CHARS96 character set */
2017 ONE_MORE_BYTE (c2);
2018 DECODE_DESIGNATION (c1 - 0x2C, 1, 96, c2);
2020 else
2021 goto label_invalid_code;
2022 /* We must update these variables now. */
2023 charset0 = CODING_SPEC_ISO_PLANE_CHARSET (coding, 0);
2024 charset1 = CODING_SPEC_ISO_PLANE_CHARSET (coding, 1);
2025 continue;
2029 /* Now we know CHARSET and 1st position code C1 of a character.
2030 Produce a multibyte sequence for that character while getting
2031 2nd position code C2 if necessary. */
2032 if (CHARSET_DIMENSION (charset) == 2)
2034 ONE_MORE_BYTE (c2);
2035 if (c1 < 0x80 ? c2 < 0x20 || c2 >= 0x80 : c2 < 0xA0)
2036 /* C2 is not in a valid range. */
2037 goto label_invalid_code;
2039 c = DECODE_ISO_CHARACTER (charset, c1, c2);
2040 EMIT_CHAR (c);
2041 continue;
2043 label_invalid_code:
2044 coding->errors++;
2045 if (COMPOSING_P (coding))
2046 DECODE_COMPOSITION_END ('1');
2047 src = src_base;
2048 c = *src++;
2049 EMIT_CHAR (c);
2052 label_end_of_loop:
2053 coding->consumed = coding->consumed_char = src_base - source;
2054 coding->produced = dst - destination;
2055 return;
2059 /* ISO2022 encoding stuff. */
2062 It is not enough to say just "ISO2022" on encoding, we have to
2063 specify more details. In Emacs, each ISO2022 coding system
2064 variant has the following specifications:
2065 1. Initial designation to G0 through G3.
2066 2. Allows short-form designation?
2067 3. ASCII should be designated to G0 before control characters?
2068 4. ASCII should be designated to G0 at end of line?
2069 5. 7-bit environment or 8-bit environment?
2070 6. Use locking-shift?
2071 7. Use Single-shift?
2072 And the following two are only for Japanese:
2073 8. Use ASCII in place of JIS0201-1976-Roman?
2074 9. Use JISX0208-1983 in place of JISX0208-1978?
2075 These specifications are encoded in `coding->flags' as flag bits
2076 defined by macros CODING_FLAG_ISO_XXX. See `coding.h' for more
2077 details.
2080 /* Produce codes (escape sequence) for designating CHARSET to graphic
2081 register REG at DST, and increment DST. If <final-char> of CHARSET is
2082 '@', 'A', or 'B' and the coding system CODING allows, produce
2083 designation sequence of short-form. */
2085 #define ENCODE_DESIGNATION(charset, reg, coding) \
2086 do { \
2087 unsigned char final_char = CHARSET_ISO_FINAL_CHAR (charset); \
2088 char *intermediate_char_94 = "()*+"; \
2089 char *intermediate_char_96 = ",-./"; \
2090 int revision = CODING_SPEC_ISO_REVISION_NUMBER(coding, charset); \
2092 if (revision < 255) \
2094 *dst++ = ISO_CODE_ESC; \
2095 *dst++ = '&'; \
2096 *dst++ = '@' + revision; \
2098 *dst++ = ISO_CODE_ESC; \
2099 if (CHARSET_DIMENSION (charset) == 1) \
2101 if (CHARSET_CHARS (charset) == 94) \
2102 *dst++ = (unsigned char) (intermediate_char_94[reg]); \
2103 else \
2104 *dst++ = (unsigned char) (intermediate_char_96[reg]); \
2106 else \
2108 *dst++ = '$'; \
2109 if (CHARSET_CHARS (charset) == 94) \
2111 if (! (coding->flags & CODING_FLAG_ISO_SHORT_FORM) \
2112 || reg != 0 \
2113 || final_char < '@' || final_char > 'B') \
2114 *dst++ = (unsigned char) (intermediate_char_94[reg]); \
2116 else \
2117 *dst++ = (unsigned char) (intermediate_char_96[reg]); \
2119 *dst++ = final_char; \
2120 CODING_SPEC_ISO_DESIGNATION (coding, reg) = charset; \
2121 } while (0)
2123 /* The following two macros produce codes (control character or escape
2124 sequence) for ISO2022 single-shift functions (single-shift-2 and
2125 single-shift-3). */
2127 #define ENCODE_SINGLE_SHIFT_2 \
2128 do { \
2129 if (coding->flags & CODING_FLAG_ISO_SEVEN_BITS) \
2130 *dst++ = ISO_CODE_ESC, *dst++ = 'N'; \
2131 else \
2132 *dst++ = ISO_CODE_SS2; \
2133 CODING_SPEC_ISO_SINGLE_SHIFTING (coding) = 1; \
2134 } while (0)
2136 #define ENCODE_SINGLE_SHIFT_3 \
2137 do { \
2138 if (coding->flags & CODING_FLAG_ISO_SEVEN_BITS) \
2139 *dst++ = ISO_CODE_ESC, *dst++ = 'O'; \
2140 else \
2141 *dst++ = ISO_CODE_SS3; \
2142 CODING_SPEC_ISO_SINGLE_SHIFTING (coding) = 1; \
2143 } while (0)
2145 /* The following four macros produce codes (control character or
2146 escape sequence) for ISO2022 locking-shift functions (shift-in,
2147 shift-out, locking-shift-2, and locking-shift-3). */
2149 #define ENCODE_SHIFT_IN \
2150 do { \
2151 *dst++ = ISO_CODE_SI; \
2152 CODING_SPEC_ISO_INVOCATION (coding, 0) = 0; \
2153 } while (0)
2155 #define ENCODE_SHIFT_OUT \
2156 do { \
2157 *dst++ = ISO_CODE_SO; \
2158 CODING_SPEC_ISO_INVOCATION (coding, 0) = 1; \
2159 } while (0)
2161 #define ENCODE_LOCKING_SHIFT_2 \
2162 do { \
2163 *dst++ = ISO_CODE_ESC, *dst++ = 'n'; \
2164 CODING_SPEC_ISO_INVOCATION (coding, 0) = 2; \
2165 } while (0)
2167 #define ENCODE_LOCKING_SHIFT_3 \
2168 do { \
2169 *dst++ = ISO_CODE_ESC, *dst++ = 'o'; \
2170 CODING_SPEC_ISO_INVOCATION (coding, 0) = 3; \
2171 } while (0)
2173 /* Produce codes for a DIMENSION1 character whose character set is
2174 CHARSET and whose position-code is C1. Designation and invocation
2175 sequences are also produced in advance if necessary. */
2177 #define ENCODE_ISO_CHARACTER_DIMENSION1(charset, c1) \
2178 do { \
2179 if (CODING_SPEC_ISO_SINGLE_SHIFTING (coding)) \
2181 if (coding->flags & CODING_FLAG_ISO_SEVEN_BITS) \
2182 *dst++ = c1 & 0x7F; \
2183 else \
2184 *dst++ = c1 | 0x80; \
2185 CODING_SPEC_ISO_SINGLE_SHIFTING (coding) = 0; \
2186 break; \
2188 else if (charset == CODING_SPEC_ISO_PLANE_CHARSET (coding, 0)) \
2190 *dst++ = c1 & 0x7F; \
2191 break; \
2193 else if (charset == CODING_SPEC_ISO_PLANE_CHARSET (coding, 1)) \
2195 *dst++ = c1 | 0x80; \
2196 break; \
2198 else \
2199 /* Since CHARSET is not yet invoked to any graphic planes, we \
2200 must invoke it, or, at first, designate it to some graphic \
2201 register. Then repeat the loop to actually produce the \
2202 character. */ \
2203 dst = encode_invocation_designation (charset, coding, dst); \
2204 } while (1)
2206 /* Produce codes for a DIMENSION2 character whose character set is
2207 CHARSET and whose position-codes are C1 and C2. Designation and
2208 invocation codes are also produced in advance if necessary. */
2210 #define ENCODE_ISO_CHARACTER_DIMENSION2(charset, c1, c2) \
2211 do { \
2212 if (CODING_SPEC_ISO_SINGLE_SHIFTING (coding)) \
2214 if (coding->flags & CODING_FLAG_ISO_SEVEN_BITS) \
2215 *dst++ = c1 & 0x7F, *dst++ = c2 & 0x7F; \
2216 else \
2217 *dst++ = c1 | 0x80, *dst++ = c2 | 0x80; \
2218 CODING_SPEC_ISO_SINGLE_SHIFTING (coding) = 0; \
2219 break; \
2221 else if (charset == CODING_SPEC_ISO_PLANE_CHARSET (coding, 0)) \
2223 *dst++ = c1 & 0x7F, *dst++= c2 & 0x7F; \
2224 break; \
2226 else if (charset == CODING_SPEC_ISO_PLANE_CHARSET (coding, 1)) \
2228 *dst++ = c1 | 0x80, *dst++= c2 | 0x80; \
2229 break; \
2231 else \
2232 /* Since CHARSET is not yet invoked to any graphic planes, we \
2233 must invoke it, or, at first, designate it to some graphic \
2234 register. Then repeat the loop to actually produce the \
2235 character. */ \
2236 dst = encode_invocation_designation (charset, coding, dst); \
2237 } while (1)
2239 #define ENCODE_ISO_CHARACTER(c) \
2240 do { \
2241 int charset, c1, c2; \
2243 SPLIT_CHAR (c, charset, c1, c2); \
2244 if (CHARSET_DEFINED_P (charset)) \
2246 if (CHARSET_DIMENSION (charset) == 1) \
2248 if (charset == CHARSET_ASCII \
2249 && coding->flags & CODING_FLAG_ISO_USE_ROMAN) \
2250 charset = charset_latin_jisx0201; \
2251 ENCODE_ISO_CHARACTER_DIMENSION1 (charset, c1); \
2253 else \
2255 if (charset == charset_jisx0208 \
2256 && coding->flags & CODING_FLAG_ISO_USE_OLDJIS) \
2257 charset = charset_jisx0208_1978; \
2258 ENCODE_ISO_CHARACTER_DIMENSION2 (charset, c1, c2); \
2261 else \
2263 *dst++ = c1; \
2264 if (c2 >= 0) \
2265 *dst++ = c2; \
2267 } while (0)
2270 /* Instead of encoding character C, produce one or two `?'s. */
2272 #define ENCODE_UNSAFE_CHARACTER(c) \
2273 do { \
2274 ENCODE_ISO_CHARACTER (CODING_INHIBIT_CHARACTER_SUBSTITUTION); \
2275 if (CHARSET_WIDTH (CHAR_CHARSET (c)) > 1) \
2276 ENCODE_ISO_CHARACTER (CODING_INHIBIT_CHARACTER_SUBSTITUTION); \
2277 } while (0)
2280 /* Produce designation and invocation codes at a place pointed by DST
2281 to use CHARSET. The element `spec.iso2022' of *CODING is updated.
2282 Return new DST. */
2284 unsigned char *
2285 encode_invocation_designation (charset, coding, dst)
2286 int charset;
2287 struct coding_system *coding;
2288 unsigned char *dst;
2290 int reg; /* graphic register number */
2292 /* At first, check designations. */
2293 for (reg = 0; reg < 4; reg++)
2294 if (charset == CODING_SPEC_ISO_DESIGNATION (coding, reg))
2295 break;
2297 if (reg >= 4)
2299 /* CHARSET is not yet designated to any graphic registers. */
2300 /* At first check the requested designation. */
2301 reg = CODING_SPEC_ISO_REQUESTED_DESIGNATION (coding, charset);
2302 if (reg == CODING_SPEC_ISO_NO_REQUESTED_DESIGNATION)
2303 /* Since CHARSET requests no special designation, designate it
2304 to graphic register 0. */
2305 reg = 0;
2307 ENCODE_DESIGNATION (charset, reg, coding);
2310 if (CODING_SPEC_ISO_INVOCATION (coding, 0) != reg
2311 && CODING_SPEC_ISO_INVOCATION (coding, 1) != reg)
2313 /* Since the graphic register REG is not invoked to any graphic
2314 planes, invoke it to graphic plane 0. */
2315 switch (reg)
2317 case 0: /* graphic register 0 */
2318 ENCODE_SHIFT_IN;
2319 break;
2321 case 1: /* graphic register 1 */
2322 ENCODE_SHIFT_OUT;
2323 break;
2325 case 2: /* graphic register 2 */
2326 if (coding->flags & CODING_FLAG_ISO_SINGLE_SHIFT)
2327 ENCODE_SINGLE_SHIFT_2;
2328 else
2329 ENCODE_LOCKING_SHIFT_2;
2330 break;
2332 case 3: /* graphic register 3 */
2333 if (coding->flags & CODING_FLAG_ISO_SINGLE_SHIFT)
2334 ENCODE_SINGLE_SHIFT_3;
2335 else
2336 ENCODE_LOCKING_SHIFT_3;
2337 break;
2341 return dst;
2344 /* Produce 2-byte codes for encoded composition rule RULE. */
2346 #define ENCODE_COMPOSITION_RULE(rule) \
2347 do { \
2348 int gref, nref; \
2349 COMPOSITION_DECODE_RULE (rule, gref, nref); \
2350 *dst++ = 32 + 81 + gref; \
2351 *dst++ = 32 + nref; \
2352 } while (0)
2354 /* Produce codes for indicating the start of a composition sequence
2355 (ESC 0, ESC 3, or ESC 4). DATA points to an array of integers
2356 which specify information about the composition. See the comment
2357 in coding.h for the format of DATA. */
2359 #define ENCODE_COMPOSITION_START(coding, data) \
2360 do { \
2361 coding->composing = data[3]; \
2362 *dst++ = ISO_CODE_ESC; \
2363 if (coding->composing == COMPOSITION_RELATIVE) \
2364 *dst++ = '0'; \
2365 else \
2367 *dst++ = (coding->composing == COMPOSITION_WITH_ALTCHARS \
2368 ? '3' : '4'); \
2369 coding->cmp_data_index = coding->cmp_data_start + 4; \
2370 coding->composition_rule_follows = 0; \
2372 } while (0)
2374 /* Produce codes for indicating the end of the current composition. */
2376 #define ENCODE_COMPOSITION_END(coding, data) \
2377 do { \
2378 *dst++ = ISO_CODE_ESC; \
2379 *dst++ = '1'; \
2380 coding->cmp_data_start += data[0]; \
2381 coding->composing = COMPOSITION_NO; \
2382 if (coding->cmp_data_start == coding->cmp_data->used \
2383 && coding->cmp_data->next) \
2385 coding->cmp_data = coding->cmp_data->next; \
2386 coding->cmp_data_start = 0; \
2388 } while (0)
2390 /* Produce composition start sequence ESC 0. Here, this sequence
2391 doesn't mean the start of a new composition but means that we have
2392 just produced components (alternate chars and composition rules) of
2393 the composition and the actual text follows in SRC. */
2395 #define ENCODE_COMPOSITION_FAKE_START(coding) \
2396 do { \
2397 *dst++ = ISO_CODE_ESC; \
2398 *dst++ = '0'; \
2399 coding->composing = COMPOSITION_RELATIVE; \
2400 } while (0)
2402 /* The following three macros produce codes for indicating direction
2403 of text. */
2404 #define ENCODE_CONTROL_SEQUENCE_INTRODUCER \
2405 do { \
2406 if (coding->flags == CODING_FLAG_ISO_SEVEN_BITS) \
2407 *dst++ = ISO_CODE_ESC, *dst++ = '['; \
2408 else \
2409 *dst++ = ISO_CODE_CSI; \
2410 } while (0)
2412 #define ENCODE_DIRECTION_R2L \
2413 ENCODE_CONTROL_SEQUENCE_INTRODUCER (dst), *dst++ = '2', *dst++ = ']'
2415 #define ENCODE_DIRECTION_L2R \
2416 ENCODE_CONTROL_SEQUENCE_INTRODUCER (dst), *dst++ = '0', *dst++ = ']'
2418 /* Produce codes for designation and invocation to reset the graphic
2419 planes and registers to initial state. */
2420 #define ENCODE_RESET_PLANE_AND_REGISTER \
2421 do { \
2422 int reg; \
2423 if (CODING_SPEC_ISO_INVOCATION (coding, 0) != 0) \
2424 ENCODE_SHIFT_IN; \
2425 for (reg = 0; reg < 4; reg++) \
2426 if (CODING_SPEC_ISO_INITIAL_DESIGNATION (coding, reg) >= 0 \
2427 && (CODING_SPEC_ISO_DESIGNATION (coding, reg) \
2428 != CODING_SPEC_ISO_INITIAL_DESIGNATION (coding, reg))) \
2429 ENCODE_DESIGNATION \
2430 (CODING_SPEC_ISO_INITIAL_DESIGNATION (coding, reg), reg, coding); \
2431 } while (0)
2433 /* Produce designation sequences of charsets in the line started from
2434 SRC to a place pointed by DST, and return updated DST.
2436 If the current block ends before any end-of-line, we may fail to
2437 find all the necessary designations. */
2439 static unsigned char *
2440 encode_designation_at_bol (coding, translation_table, src, src_end, dst)
2441 struct coding_system *coding;
2442 Lisp_Object translation_table;
2443 unsigned char *src, *src_end, *dst;
2445 int charset, c, found = 0, reg;
2446 /* Table of charsets to be designated to each graphic register. */
2447 int r[4];
2449 for (reg = 0; reg < 4; reg++)
2450 r[reg] = -1;
2452 while (found < 4)
2454 ONE_MORE_CHAR (c);
2455 if (c == '\n')
2456 break;
2458 charset = CHAR_CHARSET (c);
2459 reg = CODING_SPEC_ISO_REQUESTED_DESIGNATION (coding, charset);
2460 if (reg != CODING_SPEC_ISO_NO_REQUESTED_DESIGNATION && r[reg] < 0)
2462 found++;
2463 r[reg] = charset;
2467 label_end_of_loop:
2468 if (found)
2470 for (reg = 0; reg < 4; reg++)
2471 if (r[reg] >= 0
2472 && CODING_SPEC_ISO_DESIGNATION (coding, reg) != r[reg])
2473 ENCODE_DESIGNATION (r[reg], reg, coding);
2476 return dst;
2479 /* See the above "GENERAL NOTES on `encode_coding_XXX ()' functions". */
2481 static void
2482 encode_coding_iso2022 (coding, source, destination, src_bytes, dst_bytes)
2483 struct coding_system *coding;
2484 unsigned char *source, *destination;
2485 int src_bytes, dst_bytes;
2487 unsigned char *src = source;
2488 unsigned char *src_end = source + src_bytes;
2489 unsigned char *dst = destination;
2490 unsigned char *dst_end = destination + dst_bytes;
2491 /* Since the maximum bytes produced by each loop is 20, we subtract 19
2492 from DST_END to assure overflow checking is necessary only at the
2493 head of loop. */
2494 unsigned char *adjusted_dst_end = dst_end - 19;
2495 /* SRC_BASE remembers the start position in source in each loop.
2496 The loop will be exited when there's not enough source text to
2497 analyze multi-byte codes (within macro ONE_MORE_CHAR), or when
2498 there's not enough destination area to produce encoded codes
2499 (within macro EMIT_BYTES). */
2500 unsigned char *src_base;
2501 int c;
2502 Lisp_Object translation_table;
2503 Lisp_Object safe_chars;
2505 safe_chars = coding_safe_chars (coding->symbol);
2507 if (NILP (Venable_character_translation))
2508 translation_table = Qnil;
2509 else
2511 translation_table = coding->translation_table_for_encode;
2512 if (NILP (translation_table))
2513 translation_table = Vstandard_translation_table_for_encode;
2516 coding->consumed_char = 0;
2517 coding->errors = 0;
2518 while (1)
2520 src_base = src;
2522 if (dst >= (dst_bytes ? adjusted_dst_end : (src - 19)))
2524 coding->result = CODING_FINISH_INSUFFICIENT_DST;
2525 break;
2528 if (coding->flags & CODING_FLAG_ISO_DESIGNATE_AT_BOL
2529 && CODING_SPEC_ISO_BOL (coding))
2531 /* We have to produce designation sequences if any now. */
2532 dst = encode_designation_at_bol (coding, translation_table,
2533 src, src_end, dst);
2534 CODING_SPEC_ISO_BOL (coding) = 0;
2537 /* Check composition start and end. */
2538 if (coding->composing != COMPOSITION_DISABLED
2539 && coding->cmp_data_start < coding->cmp_data->used)
2541 struct composition_data *cmp_data = coding->cmp_data;
2542 int *data = cmp_data->data + coding->cmp_data_start;
2543 int this_pos = cmp_data->char_offset + coding->consumed_char;
2545 if (coding->composing == COMPOSITION_RELATIVE)
2547 if (this_pos == data[2])
2549 ENCODE_COMPOSITION_END (coding, data);
2550 cmp_data = coding->cmp_data;
2551 data = cmp_data->data + coding->cmp_data_start;
2554 else if (COMPOSING_P (coding))
2556 /* COMPOSITION_WITH_ALTCHARS or COMPOSITION_WITH_RULE_ALTCHAR */
2557 if (coding->cmp_data_index == coding->cmp_data_start + data[0])
2558 /* We have consumed components of the composition.
2559 What follows in SRC is the composition's base
2560 text. */
2561 ENCODE_COMPOSITION_FAKE_START (coding);
2562 else
2564 int c = cmp_data->data[coding->cmp_data_index++];
2565 if (coding->composition_rule_follows)
2567 ENCODE_COMPOSITION_RULE (c);
2568 coding->composition_rule_follows = 0;
2570 else
2572 if (coding->flags & CODING_FLAG_ISO_SAFE
2573 && ! CODING_SAFE_CHAR_P (safe_chars, c))
2574 ENCODE_UNSAFE_CHARACTER (c);
2575 else
2576 ENCODE_ISO_CHARACTER (c);
2577 if (coding->composing == COMPOSITION_WITH_RULE_ALTCHARS)
2578 coding->composition_rule_follows = 1;
2580 continue;
2583 if (!COMPOSING_P (coding))
2585 if (this_pos == data[1])
2587 ENCODE_COMPOSITION_START (coding, data);
2588 continue;
2593 ONE_MORE_CHAR (c);
2595 /* Now encode the character C. */
2596 if (c < 0x20 || c == 0x7F)
2598 if (c == '\r')
2600 if (! (coding->mode & CODING_MODE_SELECTIVE_DISPLAY))
2602 if (coding->flags & CODING_FLAG_ISO_RESET_AT_CNTL)
2603 ENCODE_RESET_PLANE_AND_REGISTER;
2604 *dst++ = c;
2605 continue;
2607 /* fall down to treat '\r' as '\n' ... */
2608 c = '\n';
2610 if (c == '\n')
2612 if (coding->flags & CODING_FLAG_ISO_RESET_AT_EOL)
2613 ENCODE_RESET_PLANE_AND_REGISTER;
2614 if (coding->flags & CODING_FLAG_ISO_INIT_AT_BOL)
2615 bcopy (coding->spec.iso2022.initial_designation,
2616 coding->spec.iso2022.current_designation,
2617 sizeof coding->spec.iso2022.initial_designation);
2618 if (coding->eol_type == CODING_EOL_LF
2619 || coding->eol_type == CODING_EOL_UNDECIDED)
2620 *dst++ = ISO_CODE_LF;
2621 else if (coding->eol_type == CODING_EOL_CRLF)
2622 *dst++ = ISO_CODE_CR, *dst++ = ISO_CODE_LF;
2623 else
2624 *dst++ = ISO_CODE_CR;
2625 CODING_SPEC_ISO_BOL (coding) = 1;
2627 else
2629 if (coding->flags & CODING_FLAG_ISO_RESET_AT_CNTL)
2630 ENCODE_RESET_PLANE_AND_REGISTER;
2631 *dst++ = c;
2634 else if (ASCII_BYTE_P (c))
2635 ENCODE_ISO_CHARACTER (c);
2636 else if (SINGLE_BYTE_CHAR_P (c))
2638 *dst++ = c;
2639 coding->errors++;
2641 else if (coding->flags & CODING_FLAG_ISO_SAFE
2642 && ! CODING_SAFE_CHAR_P (safe_chars, c))
2643 ENCODE_UNSAFE_CHARACTER (c);
2644 else
2645 ENCODE_ISO_CHARACTER (c);
2647 coding->consumed_char++;
2650 label_end_of_loop:
2651 coding->consumed = src_base - source;
2652 coding->produced = coding->produced_char = dst - destination;
2656 /*** 4. SJIS and BIG5 handlers ***/
2658 /* Although SJIS and BIG5 are not ISO coding systems, they are used
2659 quite widely. So, for the moment, Emacs supports them in the bare
2660 C code. But, in the future, they may be supported only by CCL. */
2662 /* SJIS is a coding system encoding three character sets: ASCII, right
2663 half of JISX0201-Kana, and JISX0208. An ASCII character is encoded
2664 as is. A character of charset katakana-jisx0201 is encoded by
2665 "position-code + 0x80". A character of charset japanese-jisx0208
2666 is encoded in 2-byte but two position-codes are divided and shifted
2667 so that it fits in the range below.
2669 --- CODE RANGE of SJIS ---
2670 (character set) (range)
2671 ASCII 0x00 .. 0x7F
2672 KATAKANA-JISX0201 0xA1 .. 0xDF
2673 JISX0208 (1st byte) 0x81 .. 0x9F and 0xE0 .. 0xEF
2674 (2nd byte) 0x40 .. 0x7E and 0x80 .. 0xFC
2675 -------------------------------
2679 /* BIG5 is a coding system encoding two character sets: ASCII and
2680 Big5. An ASCII character is encoded as is. Big5 is a two-byte
2681 character set and is encoded in two bytes.
2683 --- CODE RANGE of BIG5 ---
2684 (character set) (range)
2685 ASCII 0x00 .. 0x7F
2686 Big5 (1st byte) 0xA1 .. 0xFE
2687 (2nd byte) 0x40 .. 0x7E and 0xA1 .. 0xFE
2688 --------------------------
2690 Since the number of characters in Big5 is larger than maximum
2691 characters in Emacs' charset (96x96), it can't be handled as one
2692 charset. So, in Emacs, Big5 is divided into two: `charset-big5-1'
2693 and `charset-big5-2'. Both are DIMENSION2 and CHARS94. The former
2694 contains frequently used characters and the latter contains less
2695 frequently used characters. */
2697 /* Macros to decode or encode a character of Big5 in BIG5. B1 and B2
2698 are the 1st and 2nd position-codes of Big5 in BIG5 coding system.
2699 C1 and C2 are the 1st and 2nd position-codes of Emacs' internal
2700 format. CHARSET is `charset_big5_1' or `charset_big5_2'. */
2702 /* Number of Big5 characters which have the same code in 1st byte. */
2703 #define BIG5_SAME_ROW (0xFF - 0xA1 + 0x7F - 0x40)
2705 #define DECODE_BIG5(b1, b2, charset, c1, c2) \
2706 do { \
2707 unsigned int temp \
2708 = (b1 - 0xA1) * BIG5_SAME_ROW + b2 - (b2 < 0x7F ? 0x40 : 0x62); \
2709 if (b1 < 0xC9) \
2710 charset = charset_big5_1; \
2711 else \
2713 charset = charset_big5_2; \
2714 temp -= (0xC9 - 0xA1) * BIG5_SAME_ROW; \
2716 c1 = temp / (0xFF - 0xA1) + 0x21; \
2717 c2 = temp % (0xFF - 0xA1) + 0x21; \
2718 } while (0)
2720 #define ENCODE_BIG5(charset, c1, c2, b1, b2) \
2721 do { \
2722 unsigned int temp = (c1 - 0x21) * (0xFF - 0xA1) + (c2 - 0x21); \
2723 if (charset == charset_big5_2) \
2724 temp += BIG5_SAME_ROW * (0xC9 - 0xA1); \
2725 b1 = temp / BIG5_SAME_ROW + 0xA1; \
2726 b2 = temp % BIG5_SAME_ROW; \
2727 b2 += b2 < 0x3F ? 0x40 : 0x62; \
2728 } while (0)
2730 /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions".
2731 Check if a text is encoded in SJIS. If it is, return
2732 CODING_CATEGORY_MASK_SJIS, else return 0. */
2734 static int
2735 detect_coding_sjis (src, src_end, multibytep)
2736 unsigned char *src, *src_end;
2737 int multibytep;
2739 int c;
2740 /* Dummy for ONE_MORE_BYTE. */
2741 struct coding_system dummy_coding;
2742 struct coding_system *coding = &dummy_coding;
2744 while (1)
2746 ONE_MORE_BYTE_CHECK_MULTIBYTE (c, multibytep);
2747 if (c < 0x80)
2748 continue;
2749 if (c == 0x80 || c == 0xA0 || c > 0xEF)
2750 return 0;
2751 if (c <= 0x9F || c >= 0xE0)
2753 ONE_MORE_BYTE_CHECK_MULTIBYTE (c, multibytep);
2754 if (c < 0x40 || c == 0x7F || c > 0xFC)
2755 return 0;
2758 label_end_of_loop:
2759 return CODING_CATEGORY_MASK_SJIS;
2762 /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions".
2763 Check if a text is encoded in BIG5. If it is, return
2764 CODING_CATEGORY_MASK_BIG5, else return 0. */
2766 static int
2767 detect_coding_big5 (src, src_end, multibytep)
2768 unsigned char *src, *src_end;
2769 int multibytep;
2771 int c;
2772 /* Dummy for ONE_MORE_BYTE. */
2773 struct coding_system dummy_coding;
2774 struct coding_system *coding = &dummy_coding;
2776 while (1)
2778 ONE_MORE_BYTE_CHECK_MULTIBYTE (c, multibytep);
2779 if (c < 0x80)
2780 continue;
2781 if (c < 0xA1 || c > 0xFE)
2782 return 0;
2783 ONE_MORE_BYTE_CHECK_MULTIBYTE (c, multibytep);
2784 if (c < 0x40 || (c > 0x7F && c < 0xA1) || c > 0xFE)
2785 return 0;
2787 label_end_of_loop:
2788 return CODING_CATEGORY_MASK_BIG5;
2791 /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions".
2792 Check if a text is encoded in UTF-8. If it is, return
2793 CODING_CATEGORY_MASK_UTF_8, else return 0. */
2795 #define UTF_8_1_OCTET_P(c) ((c) < 0x80)
2796 #define UTF_8_EXTRA_OCTET_P(c) (((c) & 0xC0) == 0x80)
2797 #define UTF_8_2_OCTET_LEADING_P(c) (((c) & 0xE0) == 0xC0)
2798 #define UTF_8_3_OCTET_LEADING_P(c) (((c) & 0xF0) == 0xE0)
2799 #define UTF_8_4_OCTET_LEADING_P(c) (((c) & 0xF8) == 0xF0)
2800 #define UTF_8_5_OCTET_LEADING_P(c) (((c) & 0xFC) == 0xF8)
2801 #define UTF_8_6_OCTET_LEADING_P(c) (((c) & 0xFE) == 0xFC)
2803 static int
2804 detect_coding_utf_8 (src, src_end, multibytep)
2805 unsigned char *src, *src_end;
2806 int multibytep;
2808 unsigned char c;
2809 int seq_maybe_bytes;
2810 /* Dummy for ONE_MORE_BYTE. */
2811 struct coding_system dummy_coding;
2812 struct coding_system *coding = &dummy_coding;
2814 while (1)
2816 ONE_MORE_BYTE_CHECK_MULTIBYTE (c, multibytep);
2817 if (UTF_8_1_OCTET_P (c))
2818 continue;
2819 else if (UTF_8_2_OCTET_LEADING_P (c))
2820 seq_maybe_bytes = 1;
2821 else if (UTF_8_3_OCTET_LEADING_P (c))
2822 seq_maybe_bytes = 2;
2823 else if (UTF_8_4_OCTET_LEADING_P (c))
2824 seq_maybe_bytes = 3;
2825 else if (UTF_8_5_OCTET_LEADING_P (c))
2826 seq_maybe_bytes = 4;
2827 else if (UTF_8_6_OCTET_LEADING_P (c))
2828 seq_maybe_bytes = 5;
2829 else
2830 return 0;
2834 ONE_MORE_BYTE_CHECK_MULTIBYTE (c, multibytep);
2835 if (!UTF_8_EXTRA_OCTET_P (c))
2836 return 0;
2837 seq_maybe_bytes--;
2839 while (seq_maybe_bytes > 0);
2842 label_end_of_loop:
2843 return CODING_CATEGORY_MASK_UTF_8;
2846 /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions".
2847 Check if a text is encoded in UTF-16 Big Endian (endian == 1) or
2848 Little Endian (otherwise). If it is, return
2849 CODING_CATEGORY_MASK_UTF_16_BE or CODING_CATEGORY_MASK_UTF_16_LE,
2850 else return 0. */
2852 #define UTF_16_INVALID_P(val) \
2853 (((val) == 0xFFFE) \
2854 || ((val) == 0xFFFF))
2856 #define UTF_16_HIGH_SURROGATE_P(val) \
2857 (((val) & 0xD800) == 0xD800)
2859 #define UTF_16_LOW_SURROGATE_P(val) \
2860 (((val) & 0xDC00) == 0xDC00)
2862 static int
2863 detect_coding_utf_16 (src, src_end, multibytep)
2864 unsigned char *src, *src_end;
2865 int multibytep;
2867 unsigned char c1, c2;
2868 /* Dummy for TWO_MORE_BYTES. */
2869 struct coding_system dummy_coding;
2870 struct coding_system *coding = &dummy_coding;
2872 ONE_MORE_BYTE_CHECK_MULTIBYTE (c1, multibytep);
2873 ONE_MORE_BYTE_CHECK_MULTIBYTE (c2, multibytep);
2875 if ((c1 == 0xFF) && (c2 == 0xFE))
2876 return CODING_CATEGORY_MASK_UTF_16_LE;
2877 else if ((c1 == 0xFE) && (c2 == 0xFF))
2878 return CODING_CATEGORY_MASK_UTF_16_BE;
2880 label_end_of_loop:
2881 return 0;
2884 /* See the above "GENERAL NOTES on `decode_coding_XXX ()' functions".
2885 If SJIS_P is 1, decode SJIS text, else decode BIG5 test. */
2887 static void
2888 decode_coding_sjis_big5 (coding, source, destination,
2889 src_bytes, dst_bytes, sjis_p)
2890 struct coding_system *coding;
2891 unsigned char *source, *destination;
2892 int src_bytes, dst_bytes;
2893 int sjis_p;
2895 unsigned char *src = source;
2896 unsigned char *src_end = source + src_bytes;
2897 unsigned char *dst = destination;
2898 unsigned char *dst_end = destination + dst_bytes;
2899 /* SRC_BASE remembers the start position in source in each loop.
2900 The loop will be exited when there's not enough source code
2901 (within macro ONE_MORE_BYTE), or when there's not enough
2902 destination area to produce a character (within macro
2903 EMIT_CHAR). */
2904 unsigned char *src_base;
2905 Lisp_Object translation_table;
2907 if (NILP (Venable_character_translation))
2908 translation_table = Qnil;
2909 else
2911 translation_table = coding->translation_table_for_decode;
2912 if (NILP (translation_table))
2913 translation_table = Vstandard_translation_table_for_decode;
2916 coding->produced_char = 0;
2917 while (1)
2919 int c, charset, c1, c2;
2921 src_base = src;
2922 ONE_MORE_BYTE (c1);
2924 if (c1 < 0x80)
2926 charset = CHARSET_ASCII;
2927 if (c1 < 0x20)
2929 if (c1 == '\r')
2931 if (coding->eol_type == CODING_EOL_CRLF)
2933 ONE_MORE_BYTE (c2);
2934 if (c2 == '\n')
2935 c1 = c2;
2936 else
2937 /* To process C2 again, SRC is subtracted by 1. */
2938 src--;
2940 else if (coding->eol_type == CODING_EOL_CR)
2941 c1 = '\n';
2943 else if (c1 == '\n'
2944 && (coding->mode & CODING_MODE_INHIBIT_INCONSISTENT_EOL)
2945 && (coding->eol_type == CODING_EOL_CR
2946 || coding->eol_type == CODING_EOL_CRLF))
2948 coding->result = CODING_FINISH_INCONSISTENT_EOL;
2949 goto label_end_of_loop;
2953 else
2955 if (sjis_p)
2957 if (c1 == 0x80 || c1 == 0xA0 || c1 > 0xEF)
2958 goto label_invalid_code;
2959 if (c1 <= 0x9F || c1 >= 0xE0)
2961 /* SJIS -> JISX0208 */
2962 ONE_MORE_BYTE (c2);
2963 if (c2 < 0x40 || c2 == 0x7F || c2 > 0xFC)
2964 goto label_invalid_code;
2965 DECODE_SJIS (c1, c2, c1, c2);
2966 charset = charset_jisx0208;
2968 else
2969 /* SJIS -> JISX0201-Kana */
2970 charset = charset_katakana_jisx0201;
2972 else
2974 /* BIG5 -> Big5 */
2975 if (c1 < 0xA0 || c1 > 0xFE)
2976 goto label_invalid_code;
2977 ONE_MORE_BYTE (c2);
2978 if (c2 < 0x40 || (c2 > 0x7E && c2 < 0xA1) || c2 > 0xFE)
2979 goto label_invalid_code;
2980 DECODE_BIG5 (c1, c2, charset, c1, c2);
2984 c = DECODE_ISO_CHARACTER (charset, c1, c2);
2985 EMIT_CHAR (c);
2986 continue;
2988 label_invalid_code:
2989 coding->errors++;
2990 src = src_base;
2991 c = *src++;
2992 EMIT_CHAR (c);
2995 label_end_of_loop:
2996 coding->consumed = coding->consumed_char = src_base - source;
2997 coding->produced = dst - destination;
2998 return;
3001 /* See the above "GENERAL NOTES on `encode_coding_XXX ()' functions".
3002 This function can encode charsets `ascii', `katakana-jisx0201',
3003 `japanese-jisx0208', `chinese-big5-1', and `chinese-big5-2'. We
3004 are sure that all these charsets are registered as official charset
3005 (i.e. do not have extended leading-codes). Characters of other
3006 charsets are produced without any encoding. If SJIS_P is 1, encode
3007 SJIS text, else encode BIG5 text. */
3009 static void
3010 encode_coding_sjis_big5 (coding, source, destination,
3011 src_bytes, dst_bytes, sjis_p)
3012 struct coding_system *coding;
3013 unsigned char *source, *destination;
3014 int src_bytes, dst_bytes;
3015 int sjis_p;
3017 unsigned char *src = source;
3018 unsigned char *src_end = source + src_bytes;
3019 unsigned char *dst = destination;
3020 unsigned char *dst_end = destination + dst_bytes;
3021 /* SRC_BASE remembers the start position in source in each loop.
3022 The loop will be exited when there's not enough source text to
3023 analyze multi-byte codes (within macro ONE_MORE_CHAR), or when
3024 there's not enough destination area to produce encoded codes
3025 (within macro EMIT_BYTES). */
3026 unsigned char *src_base;
3027 Lisp_Object translation_table;
3029 if (NILP (Venable_character_translation))
3030 translation_table = Qnil;
3031 else
3033 translation_table = coding->translation_table_for_encode;
3034 if (NILP (translation_table))
3035 translation_table = Vstandard_translation_table_for_encode;
3038 while (1)
3040 int c, charset, c1, c2;
3042 src_base = src;
3043 ONE_MORE_CHAR (c);
3045 /* Now encode the character C. */
3046 if (SINGLE_BYTE_CHAR_P (c))
3048 switch (c)
3050 case '\r':
3051 if (!(coding->mode & CODING_MODE_SELECTIVE_DISPLAY))
3053 EMIT_ONE_BYTE (c);
3054 break;
3056 c = '\n';
3057 case '\n':
3058 if (coding->eol_type == CODING_EOL_CRLF)
3060 EMIT_TWO_BYTES ('\r', c);
3061 break;
3063 else if (coding->eol_type == CODING_EOL_CR)
3064 c = '\r';
3065 default:
3066 EMIT_ONE_BYTE (c);
3069 else
3071 SPLIT_CHAR (c, charset, c1, c2);
3072 if (sjis_p)
3074 if (charset == charset_jisx0208
3075 || charset == charset_jisx0208_1978)
3077 ENCODE_SJIS (c1, c2, c1, c2);
3078 EMIT_TWO_BYTES (c1, c2);
3080 else if (charset == charset_katakana_jisx0201)
3081 EMIT_ONE_BYTE (c1 | 0x80);
3082 else if (charset == charset_latin_jisx0201)
3083 EMIT_ONE_BYTE (c1);
3084 else
3085 /* There's no way other than producing the internal
3086 codes as is. */
3087 EMIT_BYTES (src_base, src);
3089 else
3091 if (charset == charset_big5_1 || charset == charset_big5_2)
3093 ENCODE_BIG5 (charset, c1, c2, c1, c2);
3094 EMIT_TWO_BYTES (c1, c2);
3096 else
3097 /* There's no way other than producing the internal
3098 codes as is. */
3099 EMIT_BYTES (src_base, src);
3102 coding->consumed_char++;
3105 label_end_of_loop:
3106 coding->consumed = src_base - source;
3107 coding->produced = coding->produced_char = dst - destination;
3111 /*** 5. CCL handlers ***/
3113 /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions".
3114 Check if a text is encoded in a coding system of which
3115 encoder/decoder are written in CCL program. If it is, return
3116 CODING_CATEGORY_MASK_CCL, else return 0. */
3118 static int
3119 detect_coding_ccl (src, src_end, multibytep)
3120 unsigned char *src, *src_end;
3121 int multibytep;
3123 unsigned char *valid;
3124 int c;
3125 /* Dummy for ONE_MORE_BYTE. */
3126 struct coding_system dummy_coding;
3127 struct coding_system *coding = &dummy_coding;
3129 /* No coding system is assigned to coding-category-ccl. */
3130 if (!coding_system_table[CODING_CATEGORY_IDX_CCL])
3131 return 0;
3133 valid = coding_system_table[CODING_CATEGORY_IDX_CCL]->spec.ccl.valid_codes;
3134 while (1)
3136 ONE_MORE_BYTE_CHECK_MULTIBYTE (c, multibytep);
3137 if (! valid[c])
3138 return 0;
3140 label_end_of_loop:
3141 return CODING_CATEGORY_MASK_CCL;
3145 /*** 6. End-of-line handlers ***/
3147 /* See the above "GENERAL NOTES on `decode_coding_XXX ()' functions". */
3149 static void
3150 decode_eol (coding, source, destination, src_bytes, dst_bytes)
3151 struct coding_system *coding;
3152 unsigned char *source, *destination;
3153 int src_bytes, dst_bytes;
3155 unsigned char *src = source;
3156 unsigned char *dst = destination;
3157 unsigned char *src_end = src + src_bytes;
3158 unsigned char *dst_end = dst + dst_bytes;
3159 Lisp_Object translation_table;
3160 /* SRC_BASE remembers the start position in source in each loop.
3161 The loop will be exited when there's not enough source code
3162 (within macro ONE_MORE_BYTE), or when there's not enough
3163 destination area to produce a character (within macro
3164 EMIT_CHAR). */
3165 unsigned char *src_base;
3166 int c;
3168 translation_table = Qnil;
3169 switch (coding->eol_type)
3171 case CODING_EOL_CRLF:
3172 while (1)
3174 src_base = src;
3175 ONE_MORE_BYTE (c);
3176 if (c == '\r')
3178 ONE_MORE_BYTE (c);
3179 if (c != '\n')
3181 src--;
3182 c = '\r';
3185 else if (c == '\n'
3186 && (coding->mode & CODING_MODE_INHIBIT_INCONSISTENT_EOL))
3188 coding->result = CODING_FINISH_INCONSISTENT_EOL;
3189 goto label_end_of_loop;
3191 EMIT_CHAR (c);
3193 break;
3195 case CODING_EOL_CR:
3196 while (1)
3198 src_base = src;
3199 ONE_MORE_BYTE (c);
3200 if (c == '\n')
3202 if (coding->mode & CODING_MODE_INHIBIT_INCONSISTENT_EOL)
3204 coding->result = CODING_FINISH_INCONSISTENT_EOL;
3205 goto label_end_of_loop;
3208 else if (c == '\r')
3209 c = '\n';
3210 EMIT_CHAR (c);
3212 break;
3214 default: /* no need for EOL handling */
3215 while (1)
3217 src_base = src;
3218 ONE_MORE_BYTE (c);
3219 EMIT_CHAR (c);
3223 label_end_of_loop:
3224 coding->consumed = coding->consumed_char = src_base - source;
3225 coding->produced = dst - destination;
3226 return;
3229 /* See "GENERAL NOTES about `encode_coding_XXX ()' functions". Encode
3230 format of end-of-line according to `coding->eol_type'. It also
3231 convert multibyte form 8-bit characters to unibyte if
3232 CODING->src_multibyte is nonzero. If `coding->mode &
3233 CODING_MODE_SELECTIVE_DISPLAY' is nonzero, code '\r' in source text
3234 also means end-of-line. */
3236 static void
3237 encode_eol (coding, source, destination, src_bytes, dst_bytes)
3238 struct coding_system *coding;
3239 const unsigned char *source;
3240 unsigned char *destination;
3241 int src_bytes, dst_bytes;
3243 const unsigned char *src = source;
3244 unsigned char *dst = destination;
3245 const unsigned char *src_end = src + src_bytes;
3246 unsigned char *dst_end = dst + dst_bytes;
3247 Lisp_Object translation_table;
3248 /* SRC_BASE remembers the start position in source in each loop.
3249 The loop will be exited when there's not enough source text to
3250 analyze multi-byte codes (within macro ONE_MORE_CHAR), or when
3251 there's not enough destination area to produce encoded codes
3252 (within macro EMIT_BYTES). */
3253 const unsigned char *src_base;
3254 unsigned char *tmp;
3255 int c;
3256 int selective_display = coding->mode & CODING_MODE_SELECTIVE_DISPLAY;
3258 translation_table = Qnil;
3259 if (coding->src_multibyte
3260 && *(src_end - 1) == LEADING_CODE_8_BIT_CONTROL)
3262 src_end--;
3263 src_bytes--;
3264 coding->result = CODING_FINISH_INSUFFICIENT_SRC;
3267 if (coding->eol_type == CODING_EOL_CRLF)
3269 while (src < src_end)
3271 src_base = src;
3272 c = *src++;
3273 if (c >= 0x20)
3274 EMIT_ONE_BYTE (c);
3275 else if (c == '\n' || (c == '\r' && selective_display))
3276 EMIT_TWO_BYTES ('\r', '\n');
3277 else
3278 EMIT_ONE_BYTE (c);
3280 src_base = src;
3281 label_end_of_loop:
3284 else
3286 if (!dst_bytes || src_bytes <= dst_bytes)
3288 safe_bcopy (src, dst, src_bytes);
3289 src_base = src_end;
3290 dst += src_bytes;
3292 else
3294 if (coding->src_multibyte
3295 && *(src + dst_bytes - 1) == LEADING_CODE_8_BIT_CONTROL)
3296 dst_bytes--;
3297 safe_bcopy (src, dst, dst_bytes);
3298 src_base = src + dst_bytes;
3299 dst = destination + dst_bytes;
3300 coding->result = CODING_FINISH_INSUFFICIENT_DST;
3302 if (coding->eol_type == CODING_EOL_CR)
3304 for (tmp = destination; tmp < dst; tmp++)
3305 if (*tmp == '\n') *tmp = '\r';
3307 else if (selective_display)
3309 for (tmp = destination; tmp < dst; tmp++)
3310 if (*tmp == '\r') *tmp = '\n';
3313 if (coding->src_multibyte)
3314 dst = destination + str_as_unibyte (destination, dst - destination);
3316 coding->consumed = src_base - source;
3317 coding->produced = dst - destination;
3318 coding->produced_char = coding->produced;
3322 /*** 7. C library functions ***/
3324 /* In Emacs Lisp, a coding system is represented by a Lisp symbol which
3325 has a property `coding-system'. The value of this property is a
3326 vector of length 5 (called the coding-vector). Among elements of
3327 this vector, the first (element[0]) and the fifth (element[4])
3328 carry important information for decoding/encoding. Before
3329 decoding/encoding, this information should be set in fields of a
3330 structure of type `coding_system'.
3332 The value of the property `coding-system' can be a symbol of another
3333 subsidiary coding-system. In that case, Emacs gets coding-vector
3334 from that symbol.
3336 `element[0]' contains information to be set in `coding->type'. The
3337 value and its meaning is as follows:
3339 0 -- coding_type_emacs_mule
3340 1 -- coding_type_sjis
3341 2 -- coding_type_iso2022
3342 3 -- coding_type_big5
3343 4 -- coding_type_ccl encoder/decoder written in CCL
3344 nil -- coding_type_no_conversion
3345 t -- coding_type_undecided (automatic conversion on decoding,
3346 no-conversion on encoding)
3348 `element[4]' contains information to be set in `coding->flags' and
3349 `coding->spec'. The meaning varies by `coding->type'.
3351 If `coding->type' is `coding_type_iso2022', element[4] is a vector
3352 of length 32 (of which the first 13 sub-elements are used now).
3353 Meanings of these sub-elements are:
3355 sub-element[N] where N is 0 through 3: to be set in `coding->spec.iso2022'
3356 If the value is an integer of valid charset, the charset is
3357 assumed to be designated to graphic register N initially.
3359 If the value is minus, it is a minus value of charset which
3360 reserves graphic register N, which means that the charset is
3361 not designated initially but should be designated to graphic
3362 register N just before encoding a character in that charset.
3364 If the value is nil, graphic register N is never used on
3365 encoding.
3367 sub-element[N] where N is 4 through 11: to be set in `coding->flags'
3368 Each value takes t or nil. See the section ISO2022 of
3369 `coding.h' for more information.
3371 If `coding->type' is `coding_type_big5', element[4] is t to denote
3372 BIG5-ETen or nil to denote BIG5-HKU.
3374 If `coding->type' takes the other value, element[4] is ignored.
3376 Emacs Lisp's coding systems also carry information about format of
3377 end-of-line in a value of property `eol-type'. If the value is
3378 integer, 0 means CODING_EOL_LF, 1 means CODING_EOL_CRLF, and 2
3379 means CODING_EOL_CR. If it is not integer, it should be a vector
3380 of subsidiary coding systems of which property `eol-type' has one
3381 of the above values.
3385 /* Extract information for decoding/encoding from CODING_SYSTEM_SYMBOL
3386 and set it in CODING. If CODING_SYSTEM_SYMBOL is invalid, CODING
3387 is setup so that no conversion is necessary and return -1, else
3388 return 0. */
3391 setup_coding_system (coding_system, coding)
3392 Lisp_Object coding_system;
3393 struct coding_system *coding;
3395 Lisp_Object coding_spec, coding_type, eol_type, plist;
3396 Lisp_Object val;
3398 /* At first, zero clear all members. */
3399 bzero (coding, sizeof (struct coding_system));
3401 /* Initialize some fields required for all kinds of coding systems. */
3402 coding->symbol = coding_system;
3403 coding->heading_ascii = -1;
3404 coding->post_read_conversion = coding->pre_write_conversion = Qnil;
3405 coding->composing = COMPOSITION_DISABLED;
3406 coding->cmp_data = NULL;
3408 if (NILP (coding_system))
3409 goto label_invalid_coding_system;
3411 coding_spec = Fget (coding_system, Qcoding_system);
3413 if (!VECTORP (coding_spec)
3414 || XVECTOR (coding_spec)->size != 5
3415 || !CONSP (XVECTOR (coding_spec)->contents[3]))
3416 goto label_invalid_coding_system;
3418 eol_type = inhibit_eol_conversion ? Qnil : Fget (coding_system, Qeol_type);
3419 if (VECTORP (eol_type))
3421 coding->eol_type = CODING_EOL_UNDECIDED;
3422 coding->common_flags = CODING_REQUIRE_DETECTION_MASK;
3424 else if (XFASTINT (eol_type) == 1)
3426 coding->eol_type = CODING_EOL_CRLF;
3427 coding->common_flags
3428 = CODING_REQUIRE_DECODING_MASK | CODING_REQUIRE_ENCODING_MASK;
3430 else if (XFASTINT (eol_type) == 2)
3432 coding->eol_type = CODING_EOL_CR;
3433 coding->common_flags
3434 = CODING_REQUIRE_DECODING_MASK | CODING_REQUIRE_ENCODING_MASK;
3436 else
3437 coding->eol_type = CODING_EOL_LF;
3439 coding_type = XVECTOR (coding_spec)->contents[0];
3440 /* Try short cut. */
3441 if (SYMBOLP (coding_type))
3443 if (EQ (coding_type, Qt))
3445 coding->type = coding_type_undecided;
3446 coding->common_flags |= CODING_REQUIRE_DETECTION_MASK;
3448 else
3449 coding->type = coding_type_no_conversion;
3450 /* Initialize this member. Any thing other than
3451 CODING_CATEGORY_IDX_UTF_16_BE and
3452 CODING_CATEGORY_IDX_UTF_16_LE are ok because they have
3453 special treatment in detect_eol. */
3454 coding->category_idx = CODING_CATEGORY_IDX_EMACS_MULE;
3456 return 0;
3459 /* Get values of coding system properties:
3460 `post-read-conversion', `pre-write-conversion',
3461 `translation-table-for-decode', `translation-table-for-encode'. */
3462 plist = XVECTOR (coding_spec)->contents[3];
3463 /* Pre & post conversion functions should be disabled if
3464 inhibit_eol_conversion is nonzero. This is the case that a code
3465 conversion function is called while those functions are running. */
3466 if (! inhibit_pre_post_conversion)
3468 coding->post_read_conversion = Fplist_get (plist, Qpost_read_conversion);
3469 coding->pre_write_conversion = Fplist_get (plist, Qpre_write_conversion);
3471 val = Fplist_get (plist, Qtranslation_table_for_decode);
3472 if (SYMBOLP (val))
3473 val = Fget (val, Qtranslation_table_for_decode);
3474 coding->translation_table_for_decode = CHAR_TABLE_P (val) ? val : Qnil;
3475 val = Fplist_get (plist, Qtranslation_table_for_encode);
3476 if (SYMBOLP (val))
3477 val = Fget (val, Qtranslation_table_for_encode);
3478 coding->translation_table_for_encode = CHAR_TABLE_P (val) ? val : Qnil;
3479 val = Fplist_get (plist, Qcoding_category);
3480 if (!NILP (val))
3482 val = Fget (val, Qcoding_category_index);
3483 if (INTEGERP (val))
3484 coding->category_idx = XINT (val);
3485 else
3486 goto label_invalid_coding_system;
3488 else
3489 goto label_invalid_coding_system;
3491 /* If the coding system has non-nil `composition' property, enable
3492 composition handling. */
3493 val = Fplist_get (plist, Qcomposition);
3494 if (!NILP (val))
3495 coding->composing = COMPOSITION_NO;
3497 switch (XFASTINT (coding_type))
3499 case 0:
3500 coding->type = coding_type_emacs_mule;
3501 coding->common_flags
3502 |= CODING_REQUIRE_DECODING_MASK | CODING_REQUIRE_ENCODING_MASK;
3503 if (!NILP (coding->post_read_conversion))
3504 coding->common_flags |= CODING_REQUIRE_DECODING_MASK;
3505 if (!NILP (coding->pre_write_conversion))
3506 coding->common_flags |= CODING_REQUIRE_ENCODING_MASK;
3507 break;
3509 case 1:
3510 coding->type = coding_type_sjis;
3511 coding->common_flags
3512 |= CODING_REQUIRE_DECODING_MASK | CODING_REQUIRE_ENCODING_MASK;
3513 break;
3515 case 2:
3516 coding->type = coding_type_iso2022;
3517 coding->common_flags
3518 |= CODING_REQUIRE_DECODING_MASK | CODING_REQUIRE_ENCODING_MASK;
3520 Lisp_Object val, temp;
3521 Lisp_Object *flags;
3522 int i, charset, reg_bits = 0;
3524 val = XVECTOR (coding_spec)->contents[4];
3526 if (!VECTORP (val) || XVECTOR (val)->size != 32)
3527 goto label_invalid_coding_system;
3529 flags = XVECTOR (val)->contents;
3530 coding->flags
3531 = ((NILP (flags[4]) ? 0 : CODING_FLAG_ISO_SHORT_FORM)
3532 | (NILP (flags[5]) ? 0 : CODING_FLAG_ISO_RESET_AT_EOL)
3533 | (NILP (flags[6]) ? 0 : CODING_FLAG_ISO_RESET_AT_CNTL)
3534 | (NILP (flags[7]) ? 0 : CODING_FLAG_ISO_SEVEN_BITS)
3535 | (NILP (flags[8]) ? 0 : CODING_FLAG_ISO_LOCKING_SHIFT)
3536 | (NILP (flags[9]) ? 0 : CODING_FLAG_ISO_SINGLE_SHIFT)
3537 | (NILP (flags[10]) ? 0 : CODING_FLAG_ISO_USE_ROMAN)
3538 | (NILP (flags[11]) ? 0 : CODING_FLAG_ISO_USE_OLDJIS)
3539 | (NILP (flags[12]) ? 0 : CODING_FLAG_ISO_NO_DIRECTION)
3540 | (NILP (flags[13]) ? 0 : CODING_FLAG_ISO_INIT_AT_BOL)
3541 | (NILP (flags[14]) ? 0 : CODING_FLAG_ISO_DESIGNATE_AT_BOL)
3542 | (NILP (flags[15]) ? 0 : CODING_FLAG_ISO_SAFE)
3543 | (NILP (flags[16]) ? 0 : CODING_FLAG_ISO_LATIN_EXTRA)
3546 /* Invoke graphic register 0 to plane 0. */
3547 CODING_SPEC_ISO_INVOCATION (coding, 0) = 0;
3548 /* Invoke graphic register 1 to plane 1 if we can use full 8-bit. */
3549 CODING_SPEC_ISO_INVOCATION (coding, 1)
3550 = (coding->flags & CODING_FLAG_ISO_SEVEN_BITS ? -1 : 1);
3551 /* Not single shifting at first. */
3552 CODING_SPEC_ISO_SINGLE_SHIFTING (coding) = 0;
3553 /* Beginning of buffer should also be regarded as bol. */
3554 CODING_SPEC_ISO_BOL (coding) = 1;
3556 for (charset = 0; charset <= MAX_CHARSET; charset++)
3557 CODING_SPEC_ISO_REVISION_NUMBER (coding, charset) = 255;
3558 val = Vcharset_revision_alist;
3559 while (CONSP (val))
3561 charset = get_charset_id (Fcar_safe (XCAR (val)));
3562 if (charset >= 0
3563 && (temp = Fcdr_safe (XCAR (val)), INTEGERP (temp))
3564 && (i = XINT (temp), (i >= 0 && (i + '@') < 128)))
3565 CODING_SPEC_ISO_REVISION_NUMBER (coding, charset) = i;
3566 val = XCDR (val);
3569 /* Checks FLAGS[REG] (REG = 0, 1, 2 3) and decide designations.
3570 FLAGS[REG] can be one of below:
3571 integer CHARSET: CHARSET occupies register I,
3572 t: designate nothing to REG initially, but can be used
3573 by any charsets,
3574 list of integer, nil, or t: designate the first
3575 element (if integer) to REG initially, the remaining
3576 elements (if integer) is designated to REG on request,
3577 if an element is t, REG can be used by any charsets,
3578 nil: REG is never used. */
3579 for (charset = 0; charset <= MAX_CHARSET; charset++)
3580 CODING_SPEC_ISO_REQUESTED_DESIGNATION (coding, charset)
3581 = CODING_SPEC_ISO_NO_REQUESTED_DESIGNATION;
3582 for (i = 0; i < 4; i++)
3584 if ((INTEGERP (flags[i])
3585 && (charset = XINT (flags[i]), CHARSET_VALID_P (charset)))
3586 || (charset = get_charset_id (flags[i])) >= 0)
3588 CODING_SPEC_ISO_INITIAL_DESIGNATION (coding, i) = charset;
3589 CODING_SPEC_ISO_REQUESTED_DESIGNATION (coding, charset) = i;
3591 else if (EQ (flags[i], Qt))
3593 CODING_SPEC_ISO_INITIAL_DESIGNATION (coding, i) = -1;
3594 reg_bits |= 1 << i;
3595 coding->flags |= CODING_FLAG_ISO_DESIGNATION;
3597 else if (CONSP (flags[i]))
3599 Lisp_Object tail;
3600 tail = flags[i];
3602 coding->flags |= CODING_FLAG_ISO_DESIGNATION;
3603 if ((INTEGERP (XCAR (tail))
3604 && (charset = XINT (XCAR (tail)),
3605 CHARSET_VALID_P (charset)))
3606 || (charset = get_charset_id (XCAR (tail))) >= 0)
3608 CODING_SPEC_ISO_INITIAL_DESIGNATION (coding, i) = charset;
3609 CODING_SPEC_ISO_REQUESTED_DESIGNATION (coding, charset) =i;
3611 else
3612 CODING_SPEC_ISO_INITIAL_DESIGNATION (coding, i) = -1;
3613 tail = XCDR (tail);
3614 while (CONSP (tail))
3616 if ((INTEGERP (XCAR (tail))
3617 && (charset = XINT (XCAR (tail)),
3618 CHARSET_VALID_P (charset)))
3619 || (charset = get_charset_id (XCAR (tail))) >= 0)
3620 CODING_SPEC_ISO_REQUESTED_DESIGNATION (coding, charset)
3621 = i;
3622 else if (EQ (XCAR (tail), Qt))
3623 reg_bits |= 1 << i;
3624 tail = XCDR (tail);
3627 else
3628 CODING_SPEC_ISO_INITIAL_DESIGNATION (coding, i) = -1;
3630 CODING_SPEC_ISO_DESIGNATION (coding, i)
3631 = CODING_SPEC_ISO_INITIAL_DESIGNATION (coding, i);
3634 if (reg_bits && ! (coding->flags & CODING_FLAG_ISO_LOCKING_SHIFT))
3636 /* REG 1 can be used only by locking shift in 7-bit env. */
3637 if (coding->flags & CODING_FLAG_ISO_SEVEN_BITS)
3638 reg_bits &= ~2;
3639 if (! (coding->flags & CODING_FLAG_ISO_SINGLE_SHIFT))
3640 /* Without any shifting, only REG 0 and 1 can be used. */
3641 reg_bits &= 3;
3644 if (reg_bits)
3645 for (charset = 0; charset <= MAX_CHARSET; charset++)
3647 if (CHARSET_DEFINED_P (charset)
3648 && (CODING_SPEC_ISO_REQUESTED_DESIGNATION (coding, charset)
3649 == CODING_SPEC_ISO_NO_REQUESTED_DESIGNATION))
3651 /* There exist some default graphic registers to be
3652 used by CHARSET. */
3654 /* We had better avoid designating a charset of
3655 CHARS96 to REG 0 as far as possible. */
3656 if (CHARSET_CHARS (charset) == 96)
3657 CODING_SPEC_ISO_REQUESTED_DESIGNATION (coding, charset)
3658 = (reg_bits & 2
3659 ? 1 : (reg_bits & 4 ? 2 : (reg_bits & 8 ? 3 : 0)));
3660 else
3661 CODING_SPEC_ISO_REQUESTED_DESIGNATION (coding, charset)
3662 = (reg_bits & 1
3663 ? 0 : (reg_bits & 2 ? 1 : (reg_bits & 4 ? 2 : 3)));
3667 coding->common_flags |= CODING_REQUIRE_FLUSHING_MASK;
3668 coding->spec.iso2022.last_invalid_designation_register = -1;
3669 break;
3671 case 3:
3672 coding->type = coding_type_big5;
3673 coding->common_flags
3674 |= CODING_REQUIRE_DECODING_MASK | CODING_REQUIRE_ENCODING_MASK;
3675 coding->flags
3676 = (NILP (XVECTOR (coding_spec)->contents[4])
3677 ? CODING_FLAG_BIG5_HKU
3678 : CODING_FLAG_BIG5_ETEN);
3679 break;
3681 case 4:
3682 coding->type = coding_type_ccl;
3683 coding->common_flags
3684 |= CODING_REQUIRE_DECODING_MASK | CODING_REQUIRE_ENCODING_MASK;
3686 val = XVECTOR (coding_spec)->contents[4];
3687 if (! CONSP (val)
3688 || setup_ccl_program (&(coding->spec.ccl.decoder),
3689 XCAR (val)) < 0
3690 || setup_ccl_program (&(coding->spec.ccl.encoder),
3691 XCDR (val)) < 0)
3692 goto label_invalid_coding_system;
3694 bzero (coding->spec.ccl.valid_codes, 256);
3695 val = Fplist_get (plist, Qvalid_codes);
3696 if (CONSP (val))
3698 Lisp_Object this;
3700 for (; CONSP (val); val = XCDR (val))
3702 this = XCAR (val);
3703 if (INTEGERP (this)
3704 && XINT (this) >= 0 && XINT (this) < 256)
3705 coding->spec.ccl.valid_codes[XINT (this)] = 1;
3706 else if (CONSP (this)
3707 && INTEGERP (XCAR (this))
3708 && INTEGERP (XCDR (this)))
3710 int start = XINT (XCAR (this));
3711 int end = XINT (XCDR (this));
3713 if (start >= 0 && start <= end && end < 256)
3714 while (start <= end)
3715 coding->spec.ccl.valid_codes[start++] = 1;
3720 coding->common_flags |= CODING_REQUIRE_FLUSHING_MASK;
3721 coding->spec.ccl.cr_carryover = 0;
3722 coding->spec.ccl.eight_bit_carryover[0] = 0;
3723 break;
3725 case 5:
3726 coding->type = coding_type_raw_text;
3727 break;
3729 default:
3730 goto label_invalid_coding_system;
3732 return 0;
3734 label_invalid_coding_system:
3735 coding->type = coding_type_no_conversion;
3736 coding->category_idx = CODING_CATEGORY_IDX_BINARY;
3737 coding->common_flags = 0;
3738 coding->eol_type = CODING_EOL_LF;
3739 coding->pre_write_conversion = coding->post_read_conversion = Qnil;
3740 return -1;
3743 /* Free memory blocks allocated for storing composition information. */
3745 void
3746 coding_free_composition_data (coding)
3747 struct coding_system *coding;
3749 struct composition_data *cmp_data = coding->cmp_data, *next;
3751 if (!cmp_data)
3752 return;
3753 /* Memory blocks are chained. At first, rewind to the first, then,
3754 free blocks one by one. */
3755 while (cmp_data->prev)
3756 cmp_data = cmp_data->prev;
3757 while (cmp_data)
3759 next = cmp_data->next;
3760 xfree (cmp_data);
3761 cmp_data = next;
3763 coding->cmp_data = NULL;
3766 /* Set `char_offset' member of all memory blocks pointed by
3767 coding->cmp_data to POS. */
3769 void
3770 coding_adjust_composition_offset (coding, pos)
3771 struct coding_system *coding;
3772 int pos;
3774 struct composition_data *cmp_data;
3776 for (cmp_data = coding->cmp_data; cmp_data; cmp_data = cmp_data->next)
3777 cmp_data->char_offset = pos;
3780 /* Setup raw-text or one of its subsidiaries in the structure
3781 coding_system CODING according to the already setup value eol_type
3782 in CODING. CODING should be setup for some coding system in
3783 advance. */
3785 void
3786 setup_raw_text_coding_system (coding)
3787 struct coding_system *coding;
3789 if (coding->type != coding_type_raw_text)
3791 coding->symbol = Qraw_text;
3792 coding->type = coding_type_raw_text;
3793 if (coding->eol_type != CODING_EOL_UNDECIDED)
3795 Lisp_Object subsidiaries;
3796 subsidiaries = Fget (Qraw_text, Qeol_type);
3798 if (VECTORP (subsidiaries)
3799 && XVECTOR (subsidiaries)->size == 3)
3800 coding->symbol
3801 = XVECTOR (subsidiaries)->contents[coding->eol_type];
3803 setup_coding_system (coding->symbol, coding);
3805 return;
3808 /* Emacs has a mechanism to automatically detect a coding system if it
3809 is one of Emacs' internal format, ISO2022, SJIS, and BIG5. But,
3810 it's impossible to distinguish some coding systems accurately
3811 because they use the same range of codes. So, at first, coding
3812 systems are categorized into 7, those are:
3814 o coding-category-emacs-mule
3816 The category for a coding system which has the same code range
3817 as Emacs' internal format. Assigned the coding-system (Lisp
3818 symbol) `emacs-mule' by default.
3820 o coding-category-sjis
3822 The category for a coding system which has the same code range
3823 as SJIS. Assigned the coding-system (Lisp
3824 symbol) `japanese-shift-jis' by default.
3826 o coding-category-iso-7
3828 The category for a coding system which has the same code range
3829 as ISO2022 of 7-bit environment. This doesn't use any locking
3830 shift and single shift functions. This can encode/decode all
3831 charsets. Assigned the coding-system (Lisp symbol)
3832 `iso-2022-7bit' by default.
3834 o coding-category-iso-7-tight
3836 Same as coding-category-iso-7 except that this can
3837 encode/decode only the specified charsets.
3839 o coding-category-iso-8-1
3841 The category for a coding system which has the same code range
3842 as ISO2022 of 8-bit environment and graphic plane 1 used only
3843 for DIMENSION1 charset. This doesn't use any locking shift
3844 and single shift functions. Assigned the coding-system (Lisp
3845 symbol) `iso-latin-1' by default.
3847 o coding-category-iso-8-2
3849 The category for a coding system which has the same code range
3850 as ISO2022 of 8-bit environment and graphic plane 1 used only
3851 for DIMENSION2 charset. This doesn't use any locking shift
3852 and single shift functions. Assigned the coding-system (Lisp
3853 symbol) `japanese-iso-8bit' by default.
3855 o coding-category-iso-7-else
3857 The category for a coding system which has the same code range
3858 as ISO2022 of 7-bit environment but uses locking shift or
3859 single shift functions. Assigned the coding-system (Lisp
3860 symbol) `iso-2022-7bit-lock' by default.
3862 o coding-category-iso-8-else
3864 The category for a coding system which has the same code range
3865 as ISO2022 of 8-bit environment but uses locking shift or
3866 single shift functions. Assigned the coding-system (Lisp
3867 symbol) `iso-2022-8bit-ss2' by default.
3869 o coding-category-big5
3871 The category for a coding system which has the same code range
3872 as BIG5. Assigned the coding-system (Lisp symbol)
3873 `cn-big5' by default.
3875 o coding-category-utf-8
3877 The category for a coding system which has the same code range
3878 as UTF-8 (cf. RFC2279). Assigned the coding-system (Lisp
3879 symbol) `utf-8' by default.
3881 o coding-category-utf-16-be
3883 The category for a coding system in which a text has an
3884 Unicode signature (cf. Unicode Standard) in the order of BIG
3885 endian at the head. Assigned the coding-system (Lisp symbol)
3886 `utf-16-be' by default.
3888 o coding-category-utf-16-le
3890 The category for a coding system in which a text has an
3891 Unicode signature (cf. Unicode Standard) in the order of
3892 LITTLE endian at the head. Assigned the coding-system (Lisp
3893 symbol) `utf-16-le' by default.
3895 o coding-category-ccl
3897 The category for a coding system of which encoder/decoder is
3898 written in CCL programs. The default value is nil, i.e., no
3899 coding system is assigned.
3901 o coding-category-binary
3903 The category for a coding system not categorized in any of the
3904 above. Assigned the coding-system (Lisp symbol)
3905 `no-conversion' by default.
3907 Each of them is a Lisp symbol and the value is an actual
3908 `coding-system' (this is also a Lisp symbol) assigned by a user.
3909 What Emacs does actually is to detect a category of coding system.
3910 Then, it uses a `coding-system' assigned to it. If Emacs can't
3911 decide a single possible category, it selects a category of the
3912 highest priority. Priorities of categories are also specified by a
3913 user in a Lisp variable `coding-category-list'.
3917 static
3918 int ascii_skip_code[256];
3920 /* Detect how a text of length SRC_BYTES pointed by SOURCE is encoded.
3921 If it detects possible coding systems, return an integer in which
3922 appropriate flag bits are set. Flag bits are defined by macros
3923 CODING_CATEGORY_MASK_XXX in `coding.h'. If PRIORITIES is non-NULL,
3924 it should point the table `coding_priorities'. In that case, only
3925 the flag bit for a coding system of the highest priority is set in
3926 the returned value. If MULTIBYTEP is nonzero, 8-bit codes of the
3927 range 0x80..0x9F are in multibyte form.
3929 How many ASCII characters are at the head is returned as *SKIP. */
3931 static int
3932 detect_coding_mask (source, src_bytes, priorities, skip, multibytep)
3933 unsigned char *source;
3934 int src_bytes, *priorities, *skip;
3935 int multibytep;
3937 register unsigned char c;
3938 unsigned char *src = source, *src_end = source + src_bytes;
3939 unsigned int mask, utf16_examined_p, iso2022_examined_p;
3940 int i;
3942 /* At first, skip all ASCII characters and control characters except
3943 for three ISO2022 specific control characters. */
3944 ascii_skip_code[ISO_CODE_SO] = 0;
3945 ascii_skip_code[ISO_CODE_SI] = 0;
3946 ascii_skip_code[ISO_CODE_ESC] = 0;
3948 label_loop_detect_coding:
3949 while (src < src_end && ascii_skip_code[*src]) src++;
3950 *skip = src - source;
3952 if (src >= src_end)
3953 /* We found nothing other than ASCII. There's nothing to do. */
3954 return 0;
3956 c = *src;
3957 /* The text seems to be encoded in some multilingual coding system.
3958 Now, try to find in which coding system the text is encoded. */
3959 if (c < 0x80)
3961 /* i.e. (c == ISO_CODE_ESC || c == ISO_CODE_SI || c == ISO_CODE_SO) */
3962 /* C is an ISO2022 specific control code of C0. */
3963 mask = detect_coding_iso2022 (src, src_end, multibytep);
3964 if (mask == 0)
3966 /* No valid ISO2022 code follows C. Try again. */
3967 src++;
3968 if (c == ISO_CODE_ESC)
3969 ascii_skip_code[ISO_CODE_ESC] = 1;
3970 else
3971 ascii_skip_code[ISO_CODE_SO] = ascii_skip_code[ISO_CODE_SI] = 1;
3972 goto label_loop_detect_coding;
3974 if (priorities)
3976 for (i = 0; i < CODING_CATEGORY_IDX_MAX; i++)
3978 if (mask & priorities[i])
3979 return priorities[i];
3981 return CODING_CATEGORY_MASK_RAW_TEXT;
3984 else
3986 int try;
3988 if (multibytep && c == LEADING_CODE_8_BIT_CONTROL)
3989 c = src[1] - 0x20;
3991 if (c < 0xA0)
3993 /* C is the first byte of SJIS character code,
3994 or a leading-code of Emacs' internal format (emacs-mule),
3995 or the first byte of UTF-16. */
3996 try = (CODING_CATEGORY_MASK_SJIS
3997 | CODING_CATEGORY_MASK_EMACS_MULE
3998 | CODING_CATEGORY_MASK_UTF_16_BE
3999 | CODING_CATEGORY_MASK_UTF_16_LE);
4001 /* Or, if C is a special latin extra code,
4002 or is an ISO2022 specific control code of C1 (SS2 or SS3),
4003 or is an ISO2022 control-sequence-introducer (CSI),
4004 we should also consider the possibility of ISO2022 codings. */
4005 if ((VECTORP (Vlatin_extra_code_table)
4006 && !NILP (XVECTOR (Vlatin_extra_code_table)->contents[c]))
4007 || (c == ISO_CODE_SS2 || c == ISO_CODE_SS3)
4008 || (c == ISO_CODE_CSI
4009 && (src < src_end
4010 && (*src == ']'
4011 || ((*src == '0' || *src == '1' || *src == '2')
4012 && src + 1 < src_end
4013 && src[1] == ']')))))
4014 try |= (CODING_CATEGORY_MASK_ISO_8_ELSE
4015 | CODING_CATEGORY_MASK_ISO_8BIT);
4017 else
4018 /* C is a character of ISO2022 in graphic plane right,
4019 or a SJIS's 1-byte character code (i.e. JISX0201),
4020 or the first byte of BIG5's 2-byte code,
4021 or the first byte of UTF-8/16. */
4022 try = (CODING_CATEGORY_MASK_ISO_8_ELSE
4023 | CODING_CATEGORY_MASK_ISO_8BIT
4024 | CODING_CATEGORY_MASK_SJIS
4025 | CODING_CATEGORY_MASK_BIG5
4026 | CODING_CATEGORY_MASK_UTF_8
4027 | CODING_CATEGORY_MASK_UTF_16_BE
4028 | CODING_CATEGORY_MASK_UTF_16_LE);
4030 /* Or, we may have to consider the possibility of CCL. */
4031 if (coding_system_table[CODING_CATEGORY_IDX_CCL]
4032 && (coding_system_table[CODING_CATEGORY_IDX_CCL]
4033 ->spec.ccl.valid_codes)[c])
4034 try |= CODING_CATEGORY_MASK_CCL;
4036 mask = 0;
4037 utf16_examined_p = iso2022_examined_p = 0;
4038 if (priorities)
4040 for (i = 0; i < CODING_CATEGORY_IDX_MAX; i++)
4042 if (!iso2022_examined_p
4043 && (priorities[i] & try & CODING_CATEGORY_MASK_ISO))
4045 mask |= detect_coding_iso2022 (src, src_end, multibytep);
4046 iso2022_examined_p = 1;
4048 else if (priorities[i] & try & CODING_CATEGORY_MASK_SJIS)
4049 mask |= detect_coding_sjis (src, src_end, multibytep);
4050 else if (priorities[i] & try & CODING_CATEGORY_MASK_UTF_8)
4051 mask |= detect_coding_utf_8 (src, src_end, multibytep);
4052 else if (!utf16_examined_p
4053 && (priorities[i] & try &
4054 CODING_CATEGORY_MASK_UTF_16_BE_LE))
4056 mask |= detect_coding_utf_16 (src, src_end, multibytep);
4057 utf16_examined_p = 1;
4059 else if (priorities[i] & try & CODING_CATEGORY_MASK_BIG5)
4060 mask |= detect_coding_big5 (src, src_end, multibytep);
4061 else if (priorities[i] & try & CODING_CATEGORY_MASK_EMACS_MULE)
4062 mask |= detect_coding_emacs_mule (src, src_end, multibytep);
4063 else if (priorities[i] & try & CODING_CATEGORY_MASK_CCL)
4064 mask |= detect_coding_ccl (src, src_end, multibytep);
4065 else if (priorities[i] & CODING_CATEGORY_MASK_RAW_TEXT)
4066 mask |= CODING_CATEGORY_MASK_RAW_TEXT;
4067 else if (priorities[i] & CODING_CATEGORY_MASK_BINARY)
4068 mask |= CODING_CATEGORY_MASK_BINARY;
4069 if (mask & priorities[i])
4070 return priorities[i];
4072 return CODING_CATEGORY_MASK_RAW_TEXT;
4074 if (try & CODING_CATEGORY_MASK_ISO)
4075 mask |= detect_coding_iso2022 (src, src_end, multibytep);
4076 if (try & CODING_CATEGORY_MASK_SJIS)
4077 mask |= detect_coding_sjis (src, src_end, multibytep);
4078 if (try & CODING_CATEGORY_MASK_BIG5)
4079 mask |= detect_coding_big5 (src, src_end, multibytep);
4080 if (try & CODING_CATEGORY_MASK_UTF_8)
4081 mask |= detect_coding_utf_8 (src, src_end, multibytep);
4082 if (try & CODING_CATEGORY_MASK_UTF_16_BE_LE)
4083 mask |= detect_coding_utf_16 (src, src_end, multibytep);
4084 if (try & CODING_CATEGORY_MASK_EMACS_MULE)
4085 mask |= detect_coding_emacs_mule (src, src_end, multibytep);
4086 if (try & CODING_CATEGORY_MASK_CCL)
4087 mask |= detect_coding_ccl (src, src_end, multibytep);
4089 return (mask | CODING_CATEGORY_MASK_RAW_TEXT | CODING_CATEGORY_MASK_BINARY);
4092 /* Detect how a text of length SRC_BYTES pointed by SRC is encoded.
4093 The information of the detected coding system is set in CODING. */
4095 void
4096 detect_coding (coding, src, src_bytes)
4097 struct coding_system *coding;
4098 const unsigned char *src;
4099 int src_bytes;
4101 unsigned int idx;
4102 int skip, mask;
4103 Lisp_Object val;
4105 val = Vcoding_category_list;
4106 mask = detect_coding_mask (src, src_bytes, coding_priorities, &skip,
4107 coding->src_multibyte);
4108 coding->heading_ascii = skip;
4110 if (!mask) return;
4112 /* We found a single coding system of the highest priority in MASK. */
4113 idx = 0;
4114 while (mask && ! (mask & 1)) mask >>= 1, idx++;
4115 if (! mask)
4116 idx = CODING_CATEGORY_IDX_RAW_TEXT;
4118 val = SYMBOL_VALUE (XVECTOR (Vcoding_category_table)->contents[idx]);
4120 if (coding->eol_type != CODING_EOL_UNDECIDED)
4122 Lisp_Object tmp;
4124 tmp = Fget (val, Qeol_type);
4125 if (VECTORP (tmp))
4126 val = XVECTOR (tmp)->contents[coding->eol_type];
4129 /* Setup this new coding system while preserving some slots. */
4131 int src_multibyte = coding->src_multibyte;
4132 int dst_multibyte = coding->dst_multibyte;
4134 setup_coding_system (val, coding);
4135 coding->src_multibyte = src_multibyte;
4136 coding->dst_multibyte = dst_multibyte;
4137 coding->heading_ascii = skip;
4141 /* Detect how end-of-line of a text of length SRC_BYTES pointed by
4142 SOURCE is encoded. Return one of CODING_EOL_LF, CODING_EOL_CRLF,
4143 CODING_EOL_CR, and CODING_EOL_UNDECIDED.
4145 How many non-eol characters are at the head is returned as *SKIP. */
4147 #define MAX_EOL_CHECK_COUNT 3
4149 static int
4150 detect_eol_type (source, src_bytes, skip)
4151 unsigned char *source;
4152 int src_bytes, *skip;
4154 unsigned char *src = source, *src_end = src + src_bytes;
4155 unsigned char c;
4156 int total = 0; /* How many end-of-lines are found so far. */
4157 int eol_type = CODING_EOL_UNDECIDED;
4158 int this_eol_type;
4160 *skip = 0;
4162 while (src < src_end && total < MAX_EOL_CHECK_COUNT)
4164 c = *src++;
4165 if (c == '\n' || c == '\r')
4167 if (*skip == 0)
4168 *skip = src - 1 - source;
4169 total++;
4170 if (c == '\n')
4171 this_eol_type = CODING_EOL_LF;
4172 else if (src >= src_end || *src != '\n')
4173 this_eol_type = CODING_EOL_CR;
4174 else
4175 this_eol_type = CODING_EOL_CRLF, src++;
4177 if (eol_type == CODING_EOL_UNDECIDED)
4178 /* This is the first end-of-line. */
4179 eol_type = this_eol_type;
4180 else if (eol_type != this_eol_type)
4182 /* The found type is different from what found before. */
4183 eol_type = CODING_EOL_INCONSISTENT;
4184 break;
4189 if (*skip == 0)
4190 *skip = src_end - source;
4191 return eol_type;
4194 /* Like detect_eol_type, but detect EOL type in 2-octet
4195 big-endian/little-endian format for coding systems utf-16-be and
4196 utf-16-le. */
4198 static int
4199 detect_eol_type_in_2_octet_form (source, src_bytes, skip, big_endian_p)
4200 unsigned char *source;
4201 int src_bytes, *skip, big_endian_p;
4203 unsigned char *src = source, *src_end = src + src_bytes;
4204 unsigned int c1, c2;
4205 int total = 0; /* How many end-of-lines are found so far. */
4206 int eol_type = CODING_EOL_UNDECIDED;
4207 int this_eol_type;
4208 int msb, lsb;
4210 if (big_endian_p)
4211 msb = 0, lsb = 1;
4212 else
4213 msb = 1, lsb = 0;
4215 *skip = 0;
4217 while ((src + 1) < src_end && total < MAX_EOL_CHECK_COUNT)
4219 c1 = (src[msb] << 8) | (src[lsb]);
4220 src += 2;
4222 if (c1 == '\n' || c1 == '\r')
4224 if (*skip == 0)
4225 *skip = src - 2 - source;
4226 total++;
4227 if (c1 == '\n')
4229 this_eol_type = CODING_EOL_LF;
4231 else
4233 if ((src + 1) >= src_end)
4235 this_eol_type = CODING_EOL_CR;
4237 else
4239 c2 = (src[msb] << 8) | (src[lsb]);
4240 if (c2 == '\n')
4241 this_eol_type = CODING_EOL_CRLF, src += 2;
4242 else
4243 this_eol_type = CODING_EOL_CR;
4247 if (eol_type == CODING_EOL_UNDECIDED)
4248 /* This is the first end-of-line. */
4249 eol_type = this_eol_type;
4250 else if (eol_type != this_eol_type)
4252 /* The found type is different from what found before. */
4253 eol_type = CODING_EOL_INCONSISTENT;
4254 break;
4259 if (*skip == 0)
4260 *skip = src_end - source;
4261 return eol_type;
4264 /* Detect how end-of-line of a text of length SRC_BYTES pointed by SRC
4265 is encoded. If it detects an appropriate format of end-of-line, it
4266 sets the information in *CODING. */
4268 void
4269 detect_eol (coding, src, src_bytes)
4270 struct coding_system *coding;
4271 const unsigned char *src;
4272 int src_bytes;
4274 Lisp_Object val;
4275 int skip;
4276 int eol_type;
4278 switch (coding->category_idx)
4280 case CODING_CATEGORY_IDX_UTF_16_BE:
4281 eol_type = detect_eol_type_in_2_octet_form (src, src_bytes, &skip, 1);
4282 break;
4283 case CODING_CATEGORY_IDX_UTF_16_LE:
4284 eol_type = detect_eol_type_in_2_octet_form (src, src_bytes, &skip, 0);
4285 break;
4286 default:
4287 eol_type = detect_eol_type (src, src_bytes, &skip);
4288 break;
4291 if (coding->heading_ascii > skip)
4292 coding->heading_ascii = skip;
4293 else
4294 skip = coding->heading_ascii;
4296 if (eol_type == CODING_EOL_UNDECIDED)
4297 return;
4298 if (eol_type == CODING_EOL_INCONSISTENT)
4300 #if 0
4301 /* This code is suppressed until we find a better way to
4302 distinguish raw text file and binary file. */
4304 /* If we have already detected that the coding is raw-text, the
4305 coding should actually be no-conversion. */
4306 if (coding->type == coding_type_raw_text)
4308 setup_coding_system (Qno_conversion, coding);
4309 return;
4311 /* Else, let's decode only text code anyway. */
4312 #endif /* 0 */
4313 eol_type = CODING_EOL_LF;
4316 val = Fget (coding->symbol, Qeol_type);
4317 if (VECTORP (val) && XVECTOR (val)->size == 3)
4319 int src_multibyte = coding->src_multibyte;
4320 int dst_multibyte = coding->dst_multibyte;
4321 struct composition_data *cmp_data = coding->cmp_data;
4323 setup_coding_system (XVECTOR (val)->contents[eol_type], coding);
4324 coding->src_multibyte = src_multibyte;
4325 coding->dst_multibyte = dst_multibyte;
4326 coding->heading_ascii = skip;
4327 coding->cmp_data = cmp_data;
4331 #define CONVERSION_BUFFER_EXTRA_ROOM 256
4333 #define DECODING_BUFFER_MAG(coding) \
4334 (coding->type == coding_type_iso2022 \
4335 ? 3 \
4336 : (coding->type == coding_type_ccl \
4337 ? coding->spec.ccl.decoder.buf_magnification \
4338 : 2))
4340 /* Return maximum size (bytes) of a buffer enough for decoding
4341 SRC_BYTES of text encoded in CODING. */
4344 decoding_buffer_size (coding, src_bytes)
4345 struct coding_system *coding;
4346 int src_bytes;
4348 return (src_bytes * DECODING_BUFFER_MAG (coding)
4349 + CONVERSION_BUFFER_EXTRA_ROOM);
4352 /* Return maximum size (bytes) of a buffer enough for encoding
4353 SRC_BYTES of text to CODING. */
4356 encoding_buffer_size (coding, src_bytes)
4357 struct coding_system *coding;
4358 int src_bytes;
4360 int magnification;
4362 if (coding->type == coding_type_ccl)
4363 magnification = coding->spec.ccl.encoder.buf_magnification;
4364 else if (CODING_REQUIRE_ENCODING (coding))
4365 magnification = 3;
4366 else
4367 magnification = 1;
4369 return (src_bytes * magnification + CONVERSION_BUFFER_EXTRA_ROOM);
4372 /* Working buffer for code conversion. */
4373 struct conversion_buffer
4375 int size; /* size of data. */
4376 int on_stack; /* 1 if allocated by alloca. */
4377 unsigned char *data;
4380 /* Don't use alloca for allocating memory space larger than this, lest
4381 we overflow their stack. */
4382 #define MAX_ALLOCA 16*1024
4384 /* Allocate LEN bytes of memory for BUF (struct conversion_buffer). */
4385 #define allocate_conversion_buffer(buf, len) \
4386 do { \
4387 if (len < MAX_ALLOCA) \
4389 buf.data = (unsigned char *) alloca (len); \
4390 buf.on_stack = 1; \
4392 else \
4394 buf.data = (unsigned char *) xmalloc (len); \
4395 buf.on_stack = 0; \
4397 buf.size = len; \
4398 } while (0)
4400 /* Double the allocated memory for *BUF. */
4401 static void
4402 extend_conversion_buffer (buf)
4403 struct conversion_buffer *buf;
4405 if (buf->on_stack)
4407 unsigned char *save = buf->data;
4408 buf->data = (unsigned char *) xmalloc (buf->size * 2);
4409 bcopy (save, buf->data, buf->size);
4410 buf->on_stack = 0;
4412 else
4414 buf->data = (unsigned char *) xrealloc (buf->data, buf->size * 2);
4416 buf->size *= 2;
4419 /* Free the allocated memory for BUF if it is not on stack. */
4420 static void
4421 free_conversion_buffer (buf)
4422 struct conversion_buffer *buf;
4424 if (!buf->on_stack)
4425 xfree (buf->data);
4429 ccl_coding_driver (coding, source, destination, src_bytes, dst_bytes, encodep)
4430 struct coding_system *coding;
4431 unsigned char *source, *destination;
4432 int src_bytes, dst_bytes, encodep;
4434 struct ccl_program *ccl
4435 = encodep ? &coding->spec.ccl.encoder : &coding->spec.ccl.decoder;
4436 unsigned char *dst = destination;
4438 ccl->suppress_error = coding->suppress_error;
4439 ccl->last_block = coding->mode & CODING_MODE_LAST_BLOCK;
4440 if (encodep)
4442 /* On encoding, EOL format is converted within ccl_driver. For
4443 that, setup proper information in the structure CCL. */
4444 ccl->eol_type = coding->eol_type;
4445 if (ccl->eol_type ==CODING_EOL_UNDECIDED)
4446 ccl->eol_type = CODING_EOL_LF;
4447 ccl->cr_consumed = coding->spec.ccl.cr_carryover;
4449 ccl->multibyte = coding->src_multibyte;
4450 if (coding->spec.ccl.eight_bit_carryover[0] != 0)
4452 /* Move carryover bytes to DESTINATION. */
4453 unsigned char *p = coding->spec.ccl.eight_bit_carryover;
4454 while (*p)
4455 *dst++ = *p++;
4456 coding->spec.ccl.eight_bit_carryover[0] = 0;
4457 if (dst_bytes)
4458 dst_bytes -= dst - destination;
4461 coding->produced = (ccl_driver (ccl, source, dst, src_bytes, dst_bytes,
4462 &(coding->consumed))
4463 + dst - destination);
4465 if (encodep)
4467 coding->produced_char = coding->produced;
4468 coding->spec.ccl.cr_carryover = ccl->cr_consumed;
4470 else if (!ccl->eight_bit_control)
4472 /* The produced bytes forms a valid multibyte sequence. */
4473 coding->produced_char
4474 = multibyte_chars_in_text (destination, coding->produced);
4475 coding->spec.ccl.eight_bit_carryover[0] = 0;
4477 else
4479 /* On decoding, the destination should always multibyte. But,
4480 CCL program might have been generated an invalid multibyte
4481 sequence. Here we make such a sequence valid as
4482 multibyte. */
4483 int bytes
4484 = dst_bytes ? dst_bytes : source + coding->consumed - destination;
4486 if ((coding->consumed < src_bytes
4487 || !ccl->last_block)
4488 && coding->produced >= 1
4489 && destination[coding->produced - 1] >= 0x80)
4491 /* We should not convert the tailing 8-bit codes to
4492 multibyte form even if they doesn't form a valid
4493 multibyte sequence. They may form a valid sequence in
4494 the next call. */
4495 int carryover = 0;
4497 if (destination[coding->produced - 1] < 0xA0)
4498 carryover = 1;
4499 else if (coding->produced >= 2)
4501 if (destination[coding->produced - 2] >= 0x80)
4503 if (destination[coding->produced - 2] < 0xA0)
4504 carryover = 2;
4505 else if (coding->produced >= 3
4506 && destination[coding->produced - 3] >= 0x80
4507 && destination[coding->produced - 3] < 0xA0)
4508 carryover = 3;
4511 if (carryover > 0)
4513 BCOPY_SHORT (destination + coding->produced - carryover,
4514 coding->spec.ccl.eight_bit_carryover,
4515 carryover);
4516 coding->spec.ccl.eight_bit_carryover[carryover] = 0;
4517 coding->produced -= carryover;
4520 coding->produced = str_as_multibyte (destination, bytes,
4521 coding->produced,
4522 &(coding->produced_char));
4525 switch (ccl->status)
4527 case CCL_STAT_SUSPEND_BY_SRC:
4528 coding->result = CODING_FINISH_INSUFFICIENT_SRC;
4529 break;
4530 case CCL_STAT_SUSPEND_BY_DST:
4531 coding->result = CODING_FINISH_INSUFFICIENT_DST;
4532 break;
4533 case CCL_STAT_QUIT:
4534 case CCL_STAT_INVALID_CMD:
4535 coding->result = CODING_FINISH_INTERRUPT;
4536 break;
4537 default:
4538 coding->result = CODING_FINISH_NORMAL;
4539 break;
4541 return coding->result;
4544 /* Decode EOL format of the text at PTR of BYTES length destructively
4545 according to CODING->eol_type. This is called after the CCL
4546 program produced a decoded text at PTR. If we do CRLF->LF
4547 conversion, update CODING->produced and CODING->produced_char. */
4549 static void
4550 decode_eol_post_ccl (coding, ptr, bytes)
4551 struct coding_system *coding;
4552 unsigned char *ptr;
4553 int bytes;
4555 Lisp_Object val, saved_coding_symbol;
4556 unsigned char *pend = ptr + bytes;
4557 int dummy;
4559 /* Remember the current coding system symbol. We set it back when
4560 an inconsistent EOL is found so that `last-coding-system-used' is
4561 set to the coding system that doesn't specify EOL conversion. */
4562 saved_coding_symbol = coding->symbol;
4564 coding->spec.ccl.cr_carryover = 0;
4565 if (coding->eol_type == CODING_EOL_UNDECIDED)
4567 /* Here, to avoid the call of setup_coding_system, we directly
4568 call detect_eol_type. */
4569 coding->eol_type = detect_eol_type (ptr, bytes, &dummy);
4570 if (coding->eol_type == CODING_EOL_INCONSISTENT)
4571 coding->eol_type = CODING_EOL_LF;
4572 if (coding->eol_type != CODING_EOL_UNDECIDED)
4574 val = Fget (coding->symbol, Qeol_type);
4575 if (VECTORP (val) && XVECTOR (val)->size == 3)
4576 coding->symbol = XVECTOR (val)->contents[coding->eol_type];
4578 coding->mode |= CODING_MODE_INHIBIT_INCONSISTENT_EOL;
4581 if (coding->eol_type == CODING_EOL_LF
4582 || coding->eol_type == CODING_EOL_UNDECIDED)
4584 /* We have nothing to do. */
4585 ptr = pend;
4587 else if (coding->eol_type == CODING_EOL_CRLF)
4589 unsigned char *pstart = ptr, *p = ptr;
4591 if (! (coding->mode & CODING_MODE_LAST_BLOCK)
4592 && *(pend - 1) == '\r')
4594 /* If the last character is CR, we can't handle it here
4595 because LF will be in the not-yet-decoded source text.
4596 Record that the CR is not yet processed. */
4597 coding->spec.ccl.cr_carryover = 1;
4598 coding->produced--;
4599 coding->produced_char--;
4600 pend--;
4602 while (ptr < pend)
4604 if (*ptr == '\r')
4606 if (ptr + 1 < pend && *(ptr + 1) == '\n')
4608 *p++ = '\n';
4609 ptr += 2;
4611 else
4613 if (coding->mode & CODING_MODE_INHIBIT_INCONSISTENT_EOL)
4614 goto undo_eol_conversion;
4615 *p++ = *ptr++;
4618 else if (*ptr == '\n'
4619 && coding->mode & CODING_MODE_INHIBIT_INCONSISTENT_EOL)
4620 goto undo_eol_conversion;
4621 else
4622 *p++ = *ptr++;
4623 continue;
4625 undo_eol_conversion:
4626 /* We have faced with inconsistent EOL format at PTR.
4627 Convert all LFs before PTR back to CRLFs. */
4628 for (p--, ptr--; p >= pstart; p--)
4630 if (*p == '\n')
4631 *ptr-- = '\n', *ptr-- = '\r';
4632 else
4633 *ptr-- = *p;
4635 /* If carryover is recorded, cancel it because we don't
4636 convert CRLF anymore. */
4637 if (coding->spec.ccl.cr_carryover)
4639 coding->spec.ccl.cr_carryover = 0;
4640 coding->produced++;
4641 coding->produced_char++;
4642 pend++;
4644 p = ptr = pend;
4645 coding->eol_type = CODING_EOL_LF;
4646 coding->symbol = saved_coding_symbol;
4648 if (p < pend)
4650 /* As each two-byte sequence CRLF was converted to LF, (PEND
4651 - P) is the number of deleted characters. */
4652 coding->produced -= pend - p;
4653 coding->produced_char -= pend - p;
4656 else /* i.e. coding->eol_type == CODING_EOL_CR */
4658 unsigned char *p = ptr;
4660 for (; ptr < pend; ptr++)
4662 if (*ptr == '\r')
4663 *ptr = '\n';
4664 else if (*ptr == '\n'
4665 && coding->mode & CODING_MODE_INHIBIT_INCONSISTENT_EOL)
4667 for (; p < ptr; p++)
4669 if (*p == '\n')
4670 *p = '\r';
4672 ptr = pend;
4673 coding->eol_type = CODING_EOL_LF;
4674 coding->symbol = saved_coding_symbol;
4680 /* See "GENERAL NOTES about `decode_coding_XXX ()' functions". Before
4681 decoding, it may detect coding system and format of end-of-line if
4682 those are not yet decided. The source should be unibyte, the
4683 result is multibyte if CODING->dst_multibyte is nonzero, else
4684 unibyte. */
4687 decode_coding (coding, source, destination, src_bytes, dst_bytes)
4688 struct coding_system *coding;
4689 const unsigned char *source;
4690 unsigned char *destination;
4691 int src_bytes, dst_bytes;
4693 int extra = 0;
4695 if (coding->type == coding_type_undecided)
4696 detect_coding (coding, source, src_bytes);
4698 if (coding->eol_type == CODING_EOL_UNDECIDED
4699 && coding->type != coding_type_ccl)
4701 detect_eol (coding, source, src_bytes);
4702 /* We had better recover the original eol format if we
4703 encounter an inconsistent eol format while decoding. */
4704 coding->mode |= CODING_MODE_INHIBIT_INCONSISTENT_EOL;
4707 coding->produced = coding->produced_char = 0;
4708 coding->consumed = coding->consumed_char = 0;
4709 coding->errors = 0;
4710 coding->result = CODING_FINISH_NORMAL;
4712 switch (coding->type)
4714 case coding_type_sjis:
4715 decode_coding_sjis_big5 (coding, source, destination,
4716 src_bytes, dst_bytes, 1);
4717 break;
4719 case coding_type_iso2022:
4720 decode_coding_iso2022 (coding, source, destination,
4721 src_bytes, dst_bytes);
4722 break;
4724 case coding_type_big5:
4725 decode_coding_sjis_big5 (coding, source, destination,
4726 src_bytes, dst_bytes, 0);
4727 break;
4729 case coding_type_emacs_mule:
4730 decode_coding_emacs_mule (coding, source, destination,
4731 src_bytes, dst_bytes);
4732 break;
4734 case coding_type_ccl:
4735 if (coding->spec.ccl.cr_carryover)
4737 /* Put the CR which was not processed by the previous call
4738 of decode_eol_post_ccl in DESTINATION. It will be
4739 decoded together with the following LF by the call to
4740 decode_eol_post_ccl below. */
4741 *destination = '\r';
4742 coding->produced++;
4743 coding->produced_char++;
4744 dst_bytes--;
4745 extra = coding->spec.ccl.cr_carryover;
4747 ccl_coding_driver (coding, source, destination + extra,
4748 src_bytes, dst_bytes, 0);
4749 if (coding->eol_type != CODING_EOL_LF)
4751 coding->produced += extra;
4752 coding->produced_char += extra;
4753 decode_eol_post_ccl (coding, destination, coding->produced);
4755 break;
4757 default:
4758 decode_eol (coding, source, destination, src_bytes, dst_bytes);
4761 if (coding->result == CODING_FINISH_INSUFFICIENT_SRC
4762 && coding->mode & CODING_MODE_LAST_BLOCK
4763 && coding->consumed == src_bytes)
4764 coding->result = CODING_FINISH_NORMAL;
4766 if (coding->mode & CODING_MODE_LAST_BLOCK
4767 && coding->result == CODING_FINISH_INSUFFICIENT_SRC)
4769 const unsigned char *src = source + coding->consumed;
4770 unsigned char *dst = destination + coding->produced;
4772 src_bytes -= coding->consumed;
4773 coding->errors++;
4774 if (COMPOSING_P (coding))
4775 DECODE_COMPOSITION_END ('1');
4776 while (src_bytes--)
4778 int c = *src++;
4779 dst += CHAR_STRING (c, dst);
4780 coding->produced_char++;
4782 coding->consumed = coding->consumed_char = src - source;
4783 coding->produced = dst - destination;
4784 coding->result = CODING_FINISH_NORMAL;
4787 if (!coding->dst_multibyte)
4789 coding->produced = str_as_unibyte (destination, coding->produced);
4790 coding->produced_char = coding->produced;
4793 return coding->result;
4796 /* See "GENERAL NOTES about `encode_coding_XXX ()' functions". The
4797 multibyteness of the source is CODING->src_multibyte, the
4798 multibyteness of the result is always unibyte. */
4801 encode_coding (coding, source, destination, src_bytes, dst_bytes)
4802 struct coding_system *coding;
4803 const unsigned char *source;
4804 unsigned char *destination;
4805 int src_bytes, dst_bytes;
4807 coding->produced = coding->produced_char = 0;
4808 coding->consumed = coding->consumed_char = 0;
4809 coding->errors = 0;
4810 coding->result = CODING_FINISH_NORMAL;
4812 switch (coding->type)
4814 case coding_type_sjis:
4815 encode_coding_sjis_big5 (coding, source, destination,
4816 src_bytes, dst_bytes, 1);
4817 break;
4819 case coding_type_iso2022:
4820 encode_coding_iso2022 (coding, source, destination,
4821 src_bytes, dst_bytes);
4822 break;
4824 case coding_type_big5:
4825 encode_coding_sjis_big5 (coding, source, destination,
4826 src_bytes, dst_bytes, 0);
4827 break;
4829 case coding_type_emacs_mule:
4830 encode_coding_emacs_mule (coding, source, destination,
4831 src_bytes, dst_bytes);
4832 break;
4834 case coding_type_ccl:
4835 ccl_coding_driver (coding, source, destination,
4836 src_bytes, dst_bytes, 1);
4837 break;
4839 default:
4840 encode_eol (coding, source, destination, src_bytes, dst_bytes);
4843 if (coding->mode & CODING_MODE_LAST_BLOCK
4844 && coding->result == CODING_FINISH_INSUFFICIENT_SRC)
4846 const unsigned char *src = source + coding->consumed;
4847 unsigned char *dst = destination + coding->produced;
4849 if (coding->type == coding_type_iso2022)
4850 ENCODE_RESET_PLANE_AND_REGISTER;
4851 if (COMPOSING_P (coding))
4852 *dst++ = ISO_CODE_ESC, *dst++ = '1';
4853 if (coding->consumed < src_bytes)
4855 int len = src_bytes - coding->consumed;
4857 BCOPY_SHORT (src, dst, len);
4858 if (coding->src_multibyte)
4859 len = str_as_unibyte (dst, len);
4860 dst += len;
4861 coding->consumed = src_bytes;
4863 coding->produced = coding->produced_char = dst - destination;
4864 coding->result = CODING_FINISH_NORMAL;
4867 if (coding->result == CODING_FINISH_INSUFFICIENT_SRC
4868 && coding->consumed == src_bytes)
4869 coding->result = CODING_FINISH_NORMAL;
4871 return coding->result;
4874 /* Scan text in the region between *BEG and *END (byte positions),
4875 skip characters which we don't have to decode by coding system
4876 CODING at the head and tail, then set *BEG and *END to the region
4877 of the text we actually have to convert. The caller should move
4878 the gap out of the region in advance if the region is from a
4879 buffer.
4881 If STR is not NULL, *BEG and *END are indices into STR. */
4883 static void
4884 shrink_decoding_region (beg, end, coding, str)
4885 int *beg, *end;
4886 struct coding_system *coding;
4887 unsigned char *str;
4889 unsigned char *begp_orig, *begp, *endp_orig, *endp, c;
4890 int eol_conversion;
4891 Lisp_Object translation_table;
4893 if (coding->type == coding_type_ccl
4894 || coding->type == coding_type_undecided
4895 || coding->eol_type != CODING_EOL_LF
4896 || !NILP (coding->post_read_conversion)
4897 || coding->composing != COMPOSITION_DISABLED)
4899 /* We can't skip any data. */
4900 return;
4902 if (coding->type == coding_type_no_conversion
4903 || coding->type == coding_type_raw_text
4904 || coding->type == coding_type_emacs_mule)
4906 /* We need no conversion, but don't have to skip any data here.
4907 Decoding routine handles them effectively anyway. */
4908 return;
4911 translation_table = coding->translation_table_for_decode;
4912 if (NILP (translation_table) && !NILP (Venable_character_translation))
4913 translation_table = Vstandard_translation_table_for_decode;
4914 if (CHAR_TABLE_P (translation_table))
4916 int i;
4917 for (i = 0; i < 128; i++)
4918 if (!NILP (CHAR_TABLE_REF (translation_table, i)))
4919 break;
4920 if (i < 128)
4921 /* Some ASCII character should be translated. We give up
4922 shrinking. */
4923 return;
4926 if (coding->heading_ascii >= 0)
4927 /* Detection routine has already found how much we can skip at the
4928 head. */
4929 *beg += coding->heading_ascii;
4931 if (str)
4933 begp_orig = begp = str + *beg;
4934 endp_orig = endp = str + *end;
4936 else
4938 begp_orig = begp = BYTE_POS_ADDR (*beg);
4939 endp_orig = endp = begp + *end - *beg;
4942 eol_conversion = (coding->eol_type == CODING_EOL_CR
4943 || coding->eol_type == CODING_EOL_CRLF);
4945 switch (coding->type)
4947 case coding_type_sjis:
4948 case coding_type_big5:
4949 /* We can skip all ASCII characters at the head. */
4950 if (coding->heading_ascii < 0)
4952 if (eol_conversion)
4953 while (begp < endp && *begp < 0x80 && *begp != '\r') begp++;
4954 else
4955 while (begp < endp && *begp < 0x80) begp++;
4957 /* We can skip all ASCII characters at the tail except for the
4958 second byte of SJIS or BIG5 code. */
4959 if (eol_conversion)
4960 while (begp < endp && endp[-1] < 0x80 && endp[-1] != '\r') endp--;
4961 else
4962 while (begp < endp && endp[-1] < 0x80) endp--;
4963 /* Do not consider LF as ascii if preceded by CR, since that
4964 confuses eol decoding. */
4965 if (begp < endp && endp < endp_orig && endp[-1] == '\r' && endp[0] == '\n')
4966 endp++;
4967 if (begp < endp && endp < endp_orig && endp[-1] >= 0x80)
4968 endp++;
4969 break;
4971 case coding_type_iso2022:
4972 if (CODING_SPEC_ISO_INITIAL_DESIGNATION (coding, 0) != CHARSET_ASCII)
4973 /* We can't skip any data. */
4974 break;
4975 if (coding->heading_ascii < 0)
4977 /* We can skip all ASCII characters at the head except for a
4978 few control codes. */
4979 while (begp < endp && (c = *begp) < 0x80
4980 && c != ISO_CODE_CR && c != ISO_CODE_SO
4981 && c != ISO_CODE_SI && c != ISO_CODE_ESC
4982 && (!eol_conversion || c != ISO_CODE_LF))
4983 begp++;
4985 switch (coding->category_idx)
4987 case CODING_CATEGORY_IDX_ISO_8_1:
4988 case CODING_CATEGORY_IDX_ISO_8_2:
4989 /* We can skip all ASCII characters at the tail. */
4990 if (eol_conversion)
4991 while (begp < endp && (c = endp[-1]) < 0x80 && c != '\r') endp--;
4992 else
4993 while (begp < endp && endp[-1] < 0x80) endp--;
4994 /* Do not consider LF as ascii if preceded by CR, since that
4995 confuses eol decoding. */
4996 if (begp < endp && endp < endp_orig && endp[-1] == '\r' && endp[0] == '\n')
4997 endp++;
4998 break;
5000 case CODING_CATEGORY_IDX_ISO_7:
5001 case CODING_CATEGORY_IDX_ISO_7_TIGHT:
5003 /* We can skip all characters at the tail except for 8-bit
5004 codes and ESC and the following 2-byte at the tail. */
5005 unsigned char *eight_bit = NULL;
5007 if (eol_conversion)
5008 while (begp < endp
5009 && (c = endp[-1]) != ISO_CODE_ESC && c != '\r')
5011 if (!eight_bit && c & 0x80) eight_bit = endp;
5012 endp--;
5014 else
5015 while (begp < endp
5016 && (c = endp[-1]) != ISO_CODE_ESC)
5018 if (!eight_bit && c & 0x80) eight_bit = endp;
5019 endp--;
5021 /* Do not consider LF as ascii if preceded by CR, since that
5022 confuses eol decoding. */
5023 if (begp < endp && endp < endp_orig
5024 && endp[-1] == '\r' && endp[0] == '\n')
5025 endp++;
5026 if (begp < endp && endp[-1] == ISO_CODE_ESC)
5028 if (endp + 1 < endp_orig && end[0] == '(' && end[1] == 'B')
5029 /* This is an ASCII designation sequence. We can
5030 surely skip the tail. But, if we have
5031 encountered an 8-bit code, skip only the codes
5032 after that. */
5033 endp = eight_bit ? eight_bit : endp + 2;
5034 else
5035 /* Hmmm, we can't skip the tail. */
5036 endp = endp_orig;
5038 else if (eight_bit)
5039 endp = eight_bit;
5042 break;
5044 default:
5045 abort ();
5047 *beg += begp - begp_orig;
5048 *end += endp - endp_orig;
5049 return;
5052 /* Like shrink_decoding_region but for encoding. */
5054 static void
5055 shrink_encoding_region (beg, end, coding, str)
5056 int *beg, *end;
5057 struct coding_system *coding;
5058 unsigned char *str;
5060 unsigned char *begp_orig, *begp, *endp_orig, *endp;
5061 int eol_conversion;
5062 Lisp_Object translation_table;
5064 if (coding->type == coding_type_ccl
5065 || coding->eol_type == CODING_EOL_CRLF
5066 || coding->eol_type == CODING_EOL_CR
5067 || (coding->cmp_data && coding->cmp_data->used > 0))
5069 /* We can't skip any data. */
5070 return;
5072 if (coding->type == coding_type_no_conversion
5073 || coding->type == coding_type_raw_text
5074 || coding->type == coding_type_emacs_mule
5075 || coding->type == coding_type_undecided)
5077 /* We need no conversion, but don't have to skip any data here.
5078 Encoding routine handles them effectively anyway. */
5079 return;
5082 translation_table = coding->translation_table_for_encode;
5083 if (NILP (translation_table) && !NILP (Venable_character_translation))
5084 translation_table = Vstandard_translation_table_for_encode;
5085 if (CHAR_TABLE_P (translation_table))
5087 int i;
5088 for (i = 0; i < 128; i++)
5089 if (!NILP (CHAR_TABLE_REF (translation_table, i)))
5090 break;
5091 if (i < 128)
5092 /* Some ASCII character should be translated. We give up
5093 shrinking. */
5094 return;
5097 if (str)
5099 begp_orig = begp = str + *beg;
5100 endp_orig = endp = str + *end;
5102 else
5104 begp_orig = begp = BYTE_POS_ADDR (*beg);
5105 endp_orig = endp = begp + *end - *beg;
5108 eol_conversion = (coding->eol_type == CODING_EOL_CR
5109 || coding->eol_type == CODING_EOL_CRLF);
5111 /* Here, we don't have to check coding->pre_write_conversion because
5112 the caller is expected to have handled it already. */
5113 switch (coding->type)
5115 case coding_type_iso2022:
5116 if (CODING_SPEC_ISO_INITIAL_DESIGNATION (coding, 0) != CHARSET_ASCII)
5117 /* We can't skip any data. */
5118 break;
5119 if (coding->flags & CODING_FLAG_ISO_DESIGNATE_AT_BOL)
5121 unsigned char *bol = begp;
5122 while (begp < endp && *begp < 0x80)
5124 begp++;
5125 if (begp[-1] == '\n')
5126 bol = begp;
5128 begp = bol;
5129 goto label_skip_tail;
5131 /* fall down ... */
5133 case coding_type_sjis:
5134 case coding_type_big5:
5135 /* We can skip all ASCII characters at the head and tail. */
5136 if (eol_conversion)
5137 while (begp < endp && *begp < 0x80 && *begp != '\n') begp++;
5138 else
5139 while (begp < endp && *begp < 0x80) begp++;
5140 label_skip_tail:
5141 if (eol_conversion)
5142 while (begp < endp && endp[-1] < 0x80 && endp[-1] != '\n') endp--;
5143 else
5144 while (begp < endp && *(endp - 1) < 0x80) endp--;
5145 break;
5147 default:
5148 abort ();
5151 *beg += begp - begp_orig;
5152 *end += endp - endp_orig;
5153 return;
5156 /* As shrinking conversion region requires some overhead, we don't try
5157 shrinking if the length of conversion region is less than this
5158 value. */
5159 static int shrink_conversion_region_threshhold = 1024;
5161 #define SHRINK_CONVERSION_REGION(beg, end, coding, str, encodep) \
5162 do { \
5163 if (*(end) - *(beg) > shrink_conversion_region_threshhold) \
5165 if (encodep) shrink_encoding_region (beg, end, coding, str); \
5166 else shrink_decoding_region (beg, end, coding, str); \
5168 } while (0)
5170 static Lisp_Object
5171 code_convert_region_unwind (dummy)
5172 Lisp_Object dummy;
5174 inhibit_pre_post_conversion = 0;
5175 return Qnil;
5178 /* Store information about all compositions in the range FROM and TO
5179 of OBJ in memory blocks pointed by CODING->cmp_data. OBJ is a
5180 buffer or a string, defaults to the current buffer. */
5182 void
5183 coding_save_composition (coding, from, to, obj)
5184 struct coding_system *coding;
5185 int from, to;
5186 Lisp_Object obj;
5188 Lisp_Object prop;
5189 int start, end;
5191 if (coding->composing == COMPOSITION_DISABLED)
5192 return;
5193 if (!coding->cmp_data)
5194 coding_allocate_composition_data (coding, from);
5195 if (!find_composition (from, to, &start, &end, &prop, obj)
5196 || end > to)
5197 return;
5198 if (start < from
5199 && (!find_composition (end, to, &start, &end, &prop, obj)
5200 || end > to))
5201 return;
5202 coding->composing = COMPOSITION_NO;
5205 if (COMPOSITION_VALID_P (start, end, prop))
5207 enum composition_method method = COMPOSITION_METHOD (prop);
5208 if (coding->cmp_data->used + COMPOSITION_DATA_MAX_BUNCH_LENGTH
5209 >= COMPOSITION_DATA_SIZE)
5210 coding_allocate_composition_data (coding, from);
5211 /* For relative composition, we remember start and end
5212 positions, for the other compositions, we also remember
5213 components. */
5214 CODING_ADD_COMPOSITION_START (coding, start - from, method);
5215 if (method != COMPOSITION_RELATIVE)
5217 /* We must store a*/
5218 Lisp_Object val, ch;
5220 val = COMPOSITION_COMPONENTS (prop);
5221 if (CONSP (val))
5222 while (CONSP (val))
5224 ch = XCAR (val), val = XCDR (val);
5225 CODING_ADD_COMPOSITION_COMPONENT (coding, XINT (ch));
5227 else if (VECTORP (val) || STRINGP (val))
5229 int len = (VECTORP (val)
5230 ? XVECTOR (val)->size : SCHARS (val));
5231 int i;
5232 for (i = 0; i < len; i++)
5234 ch = (STRINGP (val)
5235 ? Faref (val, make_number (i))
5236 : XVECTOR (val)->contents[i]);
5237 CODING_ADD_COMPOSITION_COMPONENT (coding, XINT (ch));
5240 else /* INTEGERP (val) */
5241 CODING_ADD_COMPOSITION_COMPONENT (coding, XINT (val));
5243 CODING_ADD_COMPOSITION_END (coding, end - from);
5245 start = end;
5247 while (start < to
5248 && find_composition (start, to, &start, &end, &prop, obj)
5249 && end <= to);
5251 /* Make coding->cmp_data point to the first memory block. */
5252 while (coding->cmp_data->prev)
5253 coding->cmp_data = coding->cmp_data->prev;
5254 coding->cmp_data_start = 0;
5257 /* Reflect the saved information about compositions to OBJ.
5258 CODING->cmp_data points to a memory block for the information. OBJ
5259 is a buffer or a string, defaults to the current buffer. */
5261 void
5262 coding_restore_composition (coding, obj)
5263 struct coding_system *coding;
5264 Lisp_Object obj;
5266 struct composition_data *cmp_data = coding->cmp_data;
5268 if (!cmp_data)
5269 return;
5271 while (cmp_data->prev)
5272 cmp_data = cmp_data->prev;
5274 while (cmp_data)
5276 int i;
5278 for (i = 0; i < cmp_data->used && cmp_data->data[i] > 0;
5279 i += cmp_data->data[i])
5281 int *data = cmp_data->data + i;
5282 enum composition_method method = (enum composition_method) data[3];
5283 Lisp_Object components;
5285 if (method == COMPOSITION_RELATIVE)
5286 components = Qnil;
5287 else
5289 int len = data[0] - 4, j;
5290 Lisp_Object args[MAX_COMPOSITION_COMPONENTS * 2 - 1];
5292 for (j = 0; j < len; j++)
5293 args[j] = make_number (data[4 + j]);
5294 components = (method == COMPOSITION_WITH_ALTCHARS
5295 ? Fstring (len, args) : Fvector (len, args));
5297 compose_text (data[1], data[2], components, Qnil, obj);
5299 cmp_data = cmp_data->next;
5303 /* Decode (if ENCODEP is zero) or encode (if ENCODEP is nonzero) the
5304 text from FROM to TO (byte positions are FROM_BYTE and TO_BYTE) by
5305 coding system CODING, and return the status code of code conversion
5306 (currently, this value has no meaning).
5308 How many characters (and bytes) are converted to how many
5309 characters (and bytes) are recorded in members of the structure
5310 CODING.
5312 If REPLACE is nonzero, we do various things as if the original text
5313 is deleted and a new text is inserted. See the comments in
5314 replace_range (insdel.c) to know what we are doing.
5316 If REPLACE is zero, it is assumed that the source text is unibyte.
5317 Otherwise, it is assumed that the source text is multibyte. */
5320 code_convert_region (from, from_byte, to, to_byte, coding, encodep, replace)
5321 int from, from_byte, to, to_byte, encodep, replace;
5322 struct coding_system *coding;
5324 int len = to - from, len_byte = to_byte - from_byte;
5325 int nchars_del = 0, nbytes_del = 0;
5326 int require, inserted, inserted_byte;
5327 int head_skip, tail_skip, total_skip = 0;
5328 Lisp_Object saved_coding_symbol;
5329 int first = 1;
5330 unsigned char *src, *dst;
5331 Lisp_Object deletion;
5332 int orig_point = PT, orig_len = len;
5333 int prev_Z;
5334 int multibyte_p = !NILP (current_buffer->enable_multibyte_characters);
5336 deletion = Qnil;
5337 saved_coding_symbol = coding->symbol;
5339 if (from < PT && PT < to)
5341 TEMP_SET_PT_BOTH (from, from_byte);
5342 orig_point = from;
5345 if (replace)
5347 int saved_from = from;
5348 int saved_inhibit_modification_hooks;
5350 prepare_to_modify_buffer (from, to, &from);
5351 if (saved_from != from)
5353 to = from + len;
5354 from_byte = CHAR_TO_BYTE (from), to_byte = CHAR_TO_BYTE (to);
5355 len_byte = to_byte - from_byte;
5358 /* The code conversion routine can not preserve text properties
5359 for now. So, we must remove all text properties in the
5360 region. Here, we must suppress all modification hooks. */
5361 saved_inhibit_modification_hooks = inhibit_modification_hooks;
5362 inhibit_modification_hooks = 1;
5363 Fset_text_properties (make_number (from), make_number (to), Qnil, Qnil);
5364 inhibit_modification_hooks = saved_inhibit_modification_hooks;
5367 if (! encodep && CODING_REQUIRE_DETECTION (coding))
5369 /* We must detect encoding of text and eol format. */
5371 if (from < GPT && to > GPT)
5372 move_gap_both (from, from_byte);
5373 if (coding->type == coding_type_undecided)
5375 detect_coding (coding, BYTE_POS_ADDR (from_byte), len_byte);
5376 if (coding->type == coding_type_undecided)
5378 /* It seems that the text contains only ASCII, but we
5379 should not leave it undecided because the deeper
5380 decoding routine (decode_coding) tries to detect the
5381 encodings again in vain. */
5382 coding->type = coding_type_emacs_mule;
5383 coding->category_idx = CODING_CATEGORY_IDX_EMACS_MULE;
5384 /* As emacs-mule decoder will handle composition, we
5385 need this setting to allocate coding->cmp_data
5386 later. */
5387 coding->composing = COMPOSITION_NO;
5390 if (coding->eol_type == CODING_EOL_UNDECIDED
5391 && coding->type != coding_type_ccl)
5393 detect_eol (coding, BYTE_POS_ADDR (from_byte), len_byte);
5394 if (coding->eol_type == CODING_EOL_UNDECIDED)
5395 coding->eol_type = CODING_EOL_LF;
5396 /* We had better recover the original eol format if we
5397 encounter an inconsistent eol format while decoding. */
5398 coding->mode |= CODING_MODE_INHIBIT_INCONSISTENT_EOL;
5402 /* Now we convert the text. */
5404 /* For encoding, we must process pre-write-conversion in advance. */
5405 if (! inhibit_pre_post_conversion
5406 && encodep
5407 && SYMBOLP (coding->pre_write_conversion)
5408 && ! NILP (Ffboundp (coding->pre_write_conversion)))
5410 /* The function in pre-write-conversion may put a new text in a
5411 new buffer. */
5412 struct buffer *prev = current_buffer;
5413 Lisp_Object new;
5415 record_unwind_protect (code_convert_region_unwind, Qnil);
5416 /* We should not call any more pre-write/post-read-conversion
5417 functions while this pre-write-conversion is running. */
5418 inhibit_pre_post_conversion = 1;
5419 call2 (coding->pre_write_conversion,
5420 make_number (from), make_number (to));
5421 inhibit_pre_post_conversion = 0;
5422 /* Discard the unwind protect. */
5423 specpdl_ptr--;
5425 if (current_buffer != prev)
5427 len = ZV - BEGV;
5428 new = Fcurrent_buffer ();
5429 set_buffer_internal_1 (prev);
5430 del_range_2 (from, from_byte, to, to_byte, 0);
5431 TEMP_SET_PT_BOTH (from, from_byte);
5432 insert_from_buffer (XBUFFER (new), 1, len, 0);
5433 Fkill_buffer (new);
5434 if (orig_point >= to)
5435 orig_point += len - orig_len;
5436 else if (orig_point > from)
5437 orig_point = from;
5438 orig_len = len;
5439 to = from + len;
5440 from_byte = CHAR_TO_BYTE (from);
5441 to_byte = CHAR_TO_BYTE (to);
5442 len_byte = to_byte - from_byte;
5443 TEMP_SET_PT_BOTH (from, from_byte);
5447 if (replace)
5449 if (! EQ (current_buffer->undo_list, Qt))
5450 deletion = make_buffer_string_both (from, from_byte, to, to_byte, 1);
5451 else
5453 nchars_del = to - from;
5454 nbytes_del = to_byte - from_byte;
5458 if (coding->composing != COMPOSITION_DISABLED)
5460 if (encodep)
5461 coding_save_composition (coding, from, to, Fcurrent_buffer ());
5462 else
5463 coding_allocate_composition_data (coding, from);
5466 /* Try to skip the heading and tailing ASCIIs. */
5467 if (coding->type != coding_type_ccl)
5469 int from_byte_orig = from_byte, to_byte_orig = to_byte;
5471 if (from < GPT && GPT < to)
5472 move_gap_both (from, from_byte);
5473 SHRINK_CONVERSION_REGION (&from_byte, &to_byte, coding, NULL, encodep);
5474 if (from_byte == to_byte
5475 && (encodep || NILP (coding->post_read_conversion))
5476 && ! CODING_REQUIRE_FLUSHING (coding))
5478 coding->produced = len_byte;
5479 coding->produced_char = len;
5480 if (!replace)
5481 /* We must record and adjust for this new text now. */
5482 adjust_after_insert (from, from_byte_orig, to, to_byte_orig, len);
5483 return 0;
5486 head_skip = from_byte - from_byte_orig;
5487 tail_skip = to_byte_orig - to_byte;
5488 total_skip = head_skip + tail_skip;
5489 from += head_skip;
5490 to -= tail_skip;
5491 len -= total_skip; len_byte -= total_skip;
5494 /* For conversion, we must put the gap before the text in addition to
5495 making the gap larger for efficient decoding. The required gap
5496 size starts from 2000 which is the magic number used in make_gap.
5497 But, after one batch of conversion, it will be incremented if we
5498 find that it is not enough . */
5499 require = 2000;
5501 if (GAP_SIZE < require)
5502 make_gap (require - GAP_SIZE);
5503 move_gap_both (from, from_byte);
5505 inserted = inserted_byte = 0;
5507 GAP_SIZE += len_byte;
5508 ZV -= len;
5509 Z -= len;
5510 ZV_BYTE -= len_byte;
5511 Z_BYTE -= len_byte;
5513 if (GPT - BEG < BEG_UNCHANGED)
5514 BEG_UNCHANGED = GPT - BEG;
5515 if (Z - GPT < END_UNCHANGED)
5516 END_UNCHANGED = Z - GPT;
5518 if (!encodep && coding->src_multibyte)
5520 /* Decoding routines expects that the source text is unibyte.
5521 We must convert 8-bit characters of multibyte form to
5522 unibyte. */
5523 int len_byte_orig = len_byte;
5524 len_byte = str_as_unibyte (GAP_END_ADDR - len_byte, len_byte);
5525 if (len_byte < len_byte_orig)
5526 safe_bcopy (GAP_END_ADDR - len_byte_orig, GAP_END_ADDR - len_byte,
5527 len_byte);
5528 coding->src_multibyte = 0;
5531 for (;;)
5533 int result;
5535 /* The buffer memory is now:
5536 +--------+converted-text+---------+-------original-text-------+---+
5537 |<-from->|<--inserted-->|---------|<--------len_byte--------->|---|
5538 |<---------------------- GAP ----------------------->| */
5539 src = GAP_END_ADDR - len_byte;
5540 dst = GPT_ADDR + inserted_byte;
5542 if (encodep)
5543 result = encode_coding (coding, src, dst, len_byte, 0);
5544 else
5546 if (coding->composing != COMPOSITION_DISABLED)
5547 coding->cmp_data->char_offset = from + inserted;
5548 result = decode_coding (coding, src, dst, len_byte, 0);
5551 /* The buffer memory is now:
5552 +--------+-------converted-text----+--+------original-text----+---+
5553 |<-from->|<-inserted->|<-produced->|--|<-(len_byte-consumed)->|---|
5554 |<---------------------- GAP ----------------------->| */
5556 inserted += coding->produced_char;
5557 inserted_byte += coding->produced;
5558 len_byte -= coding->consumed;
5560 if (result == CODING_FINISH_INSUFFICIENT_CMP)
5562 coding_allocate_composition_data (coding, from + inserted);
5563 continue;
5566 src += coding->consumed;
5567 dst += coding->produced;
5569 if (result == CODING_FINISH_NORMAL)
5571 src += len_byte;
5572 break;
5574 if (! encodep && result == CODING_FINISH_INCONSISTENT_EOL)
5576 unsigned char *pend = dst, *p = pend - inserted_byte;
5577 Lisp_Object eol_type;
5579 /* Encode LFs back to the original eol format (CR or CRLF). */
5580 if (coding->eol_type == CODING_EOL_CR)
5582 while (p < pend) if (*p++ == '\n') p[-1] = '\r';
5584 else
5586 int count = 0;
5588 while (p < pend) if (*p++ == '\n') count++;
5589 if (src - dst < count)
5591 /* We don't have sufficient room for encoding LFs
5592 back to CRLF. We must record converted and
5593 not-yet-converted text back to the buffer
5594 content, enlarge the gap, then record them out of
5595 the buffer contents again. */
5596 int add = len_byte + inserted_byte;
5598 GAP_SIZE -= add;
5599 ZV += add; Z += add; ZV_BYTE += add; Z_BYTE += add;
5600 GPT += inserted_byte; GPT_BYTE += inserted_byte;
5601 make_gap (count - GAP_SIZE);
5602 GAP_SIZE += add;
5603 ZV -= add; Z -= add; ZV_BYTE -= add; Z_BYTE -= add;
5604 GPT -= inserted_byte; GPT_BYTE -= inserted_byte;
5605 /* Don't forget to update SRC, DST, and PEND. */
5606 src = GAP_END_ADDR - len_byte;
5607 dst = GPT_ADDR + inserted_byte;
5608 pend = dst;
5610 inserted += count;
5611 inserted_byte += count;
5612 coding->produced += count;
5613 p = dst = pend + count;
5614 while (count)
5616 *--p = *--pend;
5617 if (*p == '\n') count--, *--p = '\r';
5621 /* Suppress eol-format conversion in the further conversion. */
5622 coding->eol_type = CODING_EOL_LF;
5624 /* Set the coding system symbol to that for Unix-like EOL. */
5625 eol_type = Fget (saved_coding_symbol, Qeol_type);
5626 if (VECTORP (eol_type)
5627 && XVECTOR (eol_type)->size == 3
5628 && SYMBOLP (XVECTOR (eol_type)->contents[CODING_EOL_LF]))
5629 coding->symbol = XVECTOR (eol_type)->contents[CODING_EOL_LF];
5630 else
5631 coding->symbol = saved_coding_symbol;
5633 continue;
5635 if (len_byte <= 0)
5637 if (coding->type != coding_type_ccl
5638 || coding->mode & CODING_MODE_LAST_BLOCK)
5639 break;
5640 coding->mode |= CODING_MODE_LAST_BLOCK;
5641 continue;
5643 if (result == CODING_FINISH_INSUFFICIENT_SRC)
5645 /* The source text ends in invalid codes. Let's just
5646 make them valid buffer contents, and finish conversion. */
5647 if (multibyte_p)
5649 unsigned char *start = dst;
5651 inserted += len_byte;
5652 while (len_byte--)
5654 int c = *src++;
5655 dst += CHAR_STRING (c, dst);
5658 inserted_byte += dst - start;
5660 else
5662 inserted += len_byte;
5663 inserted_byte += len_byte;
5664 while (len_byte--)
5665 *dst++ = *src++;
5667 break;
5669 if (result == CODING_FINISH_INTERRUPT)
5671 /* The conversion procedure was interrupted by a user. */
5672 break;
5674 /* Now RESULT == CODING_FINISH_INSUFFICIENT_DST */
5675 if (coding->consumed < 1)
5677 /* It's quite strange to require more memory without
5678 consuming any bytes. Perhaps CCL program bug. */
5679 break;
5681 if (first)
5683 /* We have just done the first batch of conversion which was
5684 stopped because of insufficient gap. Let's reconsider the
5685 required gap size (i.e. SRT - DST) now.
5687 We have converted ORIG bytes (== coding->consumed) into
5688 NEW bytes (coding->produced). To convert the remaining
5689 LEN bytes, we may need REQUIRE bytes of gap, where:
5690 REQUIRE + LEN_BYTE = LEN_BYTE * (NEW / ORIG)
5691 REQUIRE = LEN_BYTE * (NEW - ORIG) / ORIG
5692 Here, we are sure that NEW >= ORIG. */
5693 float ratio;
5695 if (coding->produced <= coding->consumed)
5697 /* This happens because of CCL-based coding system with
5698 eol-type CRLF. */
5699 require = 0;
5701 else
5703 ratio = (coding->produced - coding->consumed) / coding->consumed;
5704 require = len_byte * ratio;
5706 first = 0;
5708 if ((src - dst) < (require + 2000))
5710 /* See the comment above the previous call of make_gap. */
5711 int add = len_byte + inserted_byte;
5713 GAP_SIZE -= add;
5714 ZV += add; Z += add; ZV_BYTE += add; Z_BYTE += add;
5715 GPT += inserted_byte; GPT_BYTE += inserted_byte;
5716 make_gap (require + 2000);
5717 GAP_SIZE += add;
5718 ZV -= add; Z -= add; ZV_BYTE -= add; Z_BYTE -= add;
5719 GPT -= inserted_byte; GPT_BYTE -= inserted_byte;
5722 if (src - dst > 0) *dst = 0; /* Put an anchor. */
5724 if (encodep && coding->dst_multibyte)
5726 /* The output is unibyte. We must convert 8-bit characters to
5727 multibyte form. */
5728 if (inserted_byte * 2 > GAP_SIZE)
5730 GAP_SIZE -= inserted_byte;
5731 ZV += inserted_byte; Z += inserted_byte;
5732 ZV_BYTE += inserted_byte; Z_BYTE += inserted_byte;
5733 GPT += inserted_byte; GPT_BYTE += inserted_byte;
5734 make_gap (inserted_byte - GAP_SIZE);
5735 GAP_SIZE += inserted_byte;
5736 ZV -= inserted_byte; Z -= inserted_byte;
5737 ZV_BYTE -= inserted_byte; Z_BYTE -= inserted_byte;
5738 GPT -= inserted_byte; GPT_BYTE -= inserted_byte;
5740 inserted_byte = str_to_multibyte (GPT_ADDR, GAP_SIZE, inserted_byte);
5743 /* If we shrank the conversion area, adjust it now. */
5744 if (total_skip > 0)
5746 if (tail_skip > 0)
5747 safe_bcopy (GAP_END_ADDR, GPT_ADDR + inserted_byte, tail_skip);
5748 inserted += total_skip; inserted_byte += total_skip;
5749 GAP_SIZE += total_skip;
5750 GPT -= head_skip; GPT_BYTE -= head_skip;
5751 ZV -= total_skip; ZV_BYTE -= total_skip;
5752 Z -= total_skip; Z_BYTE -= total_skip;
5753 from -= head_skip; from_byte -= head_skip;
5754 to += tail_skip; to_byte += tail_skip;
5757 prev_Z = Z;
5758 if (! EQ (current_buffer->undo_list, Qt))
5759 adjust_after_replace (from, from_byte, deletion, inserted, inserted_byte);
5760 else
5761 adjust_after_replace_noundo (from, from_byte, nchars_del, nbytes_del,
5762 inserted, inserted_byte);
5763 inserted = Z - prev_Z;
5765 if (!encodep && coding->cmp_data && coding->cmp_data->used)
5766 coding_restore_composition (coding, Fcurrent_buffer ());
5767 coding_free_composition_data (coding);
5769 if (! inhibit_pre_post_conversion
5770 && ! encodep && ! NILP (coding->post_read_conversion))
5772 Lisp_Object val;
5774 if (from != PT)
5775 TEMP_SET_PT_BOTH (from, from_byte);
5776 prev_Z = Z;
5777 record_unwind_protect (code_convert_region_unwind, Qnil);
5778 /* We should not call any more pre-write/post-read-conversion
5779 functions while this post-read-conversion is running. */
5780 inhibit_pre_post_conversion = 1;
5781 val = call1 (coding->post_read_conversion, make_number (inserted));
5782 inhibit_pre_post_conversion = 0;
5783 /* Discard the unwind protect. */
5784 specpdl_ptr--;
5785 CHECK_NUMBER (val);
5786 inserted += Z - prev_Z;
5789 if (orig_point >= from)
5791 if (orig_point >= from + orig_len)
5792 orig_point += inserted - orig_len;
5793 else
5794 orig_point = from;
5795 TEMP_SET_PT (orig_point);
5798 if (replace)
5800 signal_after_change (from, to - from, inserted);
5801 update_compositions (from, from + inserted, CHECK_BORDER);
5805 coding->consumed = to_byte - from_byte;
5806 coding->consumed_char = to - from;
5807 coding->produced = inserted_byte;
5808 coding->produced_char = inserted;
5811 return 0;
5814 Lisp_Object
5815 run_pre_post_conversion_on_str (str, coding, encodep)
5816 Lisp_Object str;
5817 struct coding_system *coding;
5818 int encodep;
5820 int count = SPECPDL_INDEX ();
5821 struct gcpro gcpro1, gcpro2;
5822 int multibyte = STRING_MULTIBYTE (str);
5823 Lisp_Object buffer;
5824 struct buffer *buf;
5825 Lisp_Object old_deactivate_mark;
5827 record_unwind_protect (Fset_buffer, Fcurrent_buffer ());
5828 record_unwind_protect (code_convert_region_unwind, Qnil);
5829 /* It is not crucial to specbind this. */
5830 old_deactivate_mark = Vdeactivate_mark;
5831 GCPRO2 (str, old_deactivate_mark);
5833 buffer = Fget_buffer_create (build_string (" *code-converting-work*"));
5834 buf = XBUFFER (buffer);
5836 buf->directory = current_buffer->directory;
5837 buf->read_only = Qnil;
5838 buf->filename = Qnil;
5839 buf->undo_list = Qt;
5840 buf->overlays_before = Qnil;
5841 buf->overlays_after = Qnil;
5843 set_buffer_internal (buf);
5844 /* We must insert the contents of STR as is without
5845 unibyte<->multibyte conversion. For that, we adjust the
5846 multibyteness of the working buffer to that of STR. */
5847 Ferase_buffer ();
5848 buf->enable_multibyte_characters = multibyte ? Qt : Qnil;
5850 insert_from_string (str, 0, 0,
5851 SCHARS (str), SBYTES (str), 0);
5852 UNGCPRO;
5853 inhibit_pre_post_conversion = 1;
5854 if (encodep)
5855 call2 (coding->pre_write_conversion, make_number (BEG), make_number (Z));
5856 else
5858 TEMP_SET_PT_BOTH (BEG, BEG_BYTE);
5859 call1 (coding->post_read_conversion, make_number (Z - BEG));
5861 inhibit_pre_post_conversion = 0;
5862 Vdeactivate_mark = old_deactivate_mark;
5863 str = make_buffer_string (BEG, Z, 1);
5864 return unbind_to (count, str);
5867 Lisp_Object
5868 decode_coding_string (str, coding, nocopy)
5869 Lisp_Object str;
5870 struct coding_system *coding;
5871 int nocopy;
5873 int len;
5874 struct conversion_buffer buf;
5875 int from, to_byte;
5876 Lisp_Object saved_coding_symbol;
5877 int result;
5878 int require_decoding;
5879 int shrinked_bytes = 0;
5880 Lisp_Object newstr;
5881 int consumed, consumed_char, produced, produced_char;
5883 from = 0;
5884 to_byte = SBYTES (str);
5886 saved_coding_symbol = coding->symbol;
5887 coding->src_multibyte = STRING_MULTIBYTE (str);
5888 coding->dst_multibyte = 1;
5889 if (CODING_REQUIRE_DETECTION (coding))
5891 /* See the comments in code_convert_region. */
5892 if (coding->type == coding_type_undecided)
5894 detect_coding (coding, SDATA (str), to_byte);
5895 if (coding->type == coding_type_undecided)
5897 coding->type = coding_type_emacs_mule;
5898 coding->category_idx = CODING_CATEGORY_IDX_EMACS_MULE;
5899 /* As emacs-mule decoder will handle composition, we
5900 need this setting to allocate coding->cmp_data
5901 later. */
5902 coding->composing = COMPOSITION_NO;
5905 if (coding->eol_type == CODING_EOL_UNDECIDED
5906 && coding->type != coding_type_ccl)
5908 saved_coding_symbol = coding->symbol;
5909 detect_eol (coding, SDATA (str), to_byte);
5910 if (coding->eol_type == CODING_EOL_UNDECIDED)
5911 coding->eol_type = CODING_EOL_LF;
5912 /* We had better recover the original eol format if we
5913 encounter an inconsistent eol format while decoding. */
5914 coding->mode |= CODING_MODE_INHIBIT_INCONSISTENT_EOL;
5918 if (coding->type == coding_type_no_conversion
5919 || coding->type == coding_type_raw_text)
5920 coding->dst_multibyte = 0;
5922 require_decoding = CODING_REQUIRE_DECODING (coding);
5924 if (STRING_MULTIBYTE (str))
5926 /* Decoding routines expect the source text to be unibyte. */
5927 str = Fstring_as_unibyte (str);
5928 to_byte = SBYTES (str);
5929 nocopy = 1;
5930 coding->src_multibyte = 0;
5933 /* Try to skip the heading and tailing ASCIIs. */
5934 if (require_decoding && coding->type != coding_type_ccl)
5936 SHRINK_CONVERSION_REGION (&from, &to_byte, coding, SDATA (str),
5938 if (from == to_byte)
5939 require_decoding = 0;
5940 shrinked_bytes = from + (SBYTES (str) - to_byte);
5943 if (!require_decoding)
5945 coding->consumed = SBYTES (str);
5946 coding->consumed_char = SCHARS (str);
5947 if (coding->dst_multibyte)
5949 str = Fstring_as_multibyte (str);
5950 nocopy = 1;
5952 coding->produced = SBYTES (str);
5953 coding->produced_char = SCHARS (str);
5954 return (nocopy ? str : Fcopy_sequence (str));
5957 if (coding->composing != COMPOSITION_DISABLED)
5958 coding_allocate_composition_data (coding, from);
5959 len = decoding_buffer_size (coding, to_byte - from);
5960 allocate_conversion_buffer (buf, len);
5962 consumed = consumed_char = produced = produced_char = 0;
5963 while (1)
5965 result = decode_coding (coding, SDATA (str) + from + consumed,
5966 buf.data + produced, to_byte - from - consumed,
5967 buf.size - produced);
5968 consumed += coding->consumed;
5969 consumed_char += coding->consumed_char;
5970 produced += coding->produced;
5971 produced_char += coding->produced_char;
5972 if (result == CODING_FINISH_NORMAL
5973 || (result == CODING_FINISH_INSUFFICIENT_SRC
5974 && coding->consumed == 0))
5975 break;
5976 if (result == CODING_FINISH_INSUFFICIENT_CMP)
5977 coding_allocate_composition_data (coding, from + produced_char);
5978 else if (result == CODING_FINISH_INSUFFICIENT_DST)
5979 extend_conversion_buffer (&buf);
5980 else if (result == CODING_FINISH_INCONSISTENT_EOL)
5982 Lisp_Object eol_type;
5984 /* Recover the original EOL format. */
5985 if (coding->eol_type == CODING_EOL_CR)
5987 unsigned char *p;
5988 for (p = buf.data; p < buf.data + produced; p++)
5989 if (*p == '\n') *p = '\r';
5991 else if (coding->eol_type == CODING_EOL_CRLF)
5993 int num_eol = 0;
5994 unsigned char *p0, *p1;
5995 for (p0 = buf.data, p1 = p0 + produced; p0 < p1; p0++)
5996 if (*p0 == '\n') num_eol++;
5997 if (produced + num_eol >= buf.size)
5998 extend_conversion_buffer (&buf);
5999 for (p0 = buf.data + produced, p1 = p0 + num_eol; p0 > buf.data;)
6001 *--p1 = *--p0;
6002 if (*p0 == '\n') *--p1 = '\r';
6004 produced += num_eol;
6005 produced_char += num_eol;
6007 /* Suppress eol-format conversion in the further conversion. */
6008 coding->eol_type = CODING_EOL_LF;
6010 /* Set the coding system symbol to that for Unix-like EOL. */
6011 eol_type = Fget (saved_coding_symbol, Qeol_type);
6012 if (VECTORP (eol_type)
6013 && XVECTOR (eol_type)->size == 3
6014 && SYMBOLP (XVECTOR (eol_type)->contents[CODING_EOL_LF]))
6015 coding->symbol = XVECTOR (eol_type)->contents[CODING_EOL_LF];
6016 else
6017 coding->symbol = saved_coding_symbol;
6023 coding->consumed = consumed;
6024 coding->consumed_char = consumed_char;
6025 coding->produced = produced;
6026 coding->produced_char = produced_char;
6028 if (coding->dst_multibyte)
6029 newstr = make_uninit_multibyte_string (produced_char + shrinked_bytes,
6030 produced + shrinked_bytes);
6031 else
6032 newstr = make_uninit_string (produced + shrinked_bytes);
6033 if (from > 0)
6034 STRING_COPYIN (newstr, 0, SDATA (str), from);
6035 STRING_COPYIN (newstr, from, buf.data, produced);
6036 if (shrinked_bytes > from)
6037 STRING_COPYIN (newstr, from + produced,
6038 SDATA (str) + to_byte,
6039 shrinked_bytes - from);
6040 free_conversion_buffer (&buf);
6042 if (coding->cmp_data && coding->cmp_data->used)
6043 coding_restore_composition (coding, newstr);
6044 coding_free_composition_data (coding);
6046 if (SYMBOLP (coding->post_read_conversion)
6047 && !NILP (Ffboundp (coding->post_read_conversion)))
6048 newstr = run_pre_post_conversion_on_str (newstr, coding, 0);
6050 return newstr;
6053 Lisp_Object
6054 encode_coding_string (str, coding, nocopy)
6055 Lisp_Object str;
6056 struct coding_system *coding;
6057 int nocopy;
6059 int len;
6060 struct conversion_buffer buf;
6061 int from, to, to_byte;
6062 int result;
6063 int shrinked_bytes = 0;
6064 Lisp_Object newstr;
6065 int consumed, consumed_char, produced, produced_char;
6067 if (SYMBOLP (coding->pre_write_conversion)
6068 && !NILP (Ffboundp (coding->pre_write_conversion)))
6069 str = run_pre_post_conversion_on_str (str, coding, 1);
6071 from = 0;
6072 to = SCHARS (str);
6073 to_byte = SBYTES (str);
6075 /* Encoding routines determine the multibyteness of the source text
6076 by coding->src_multibyte. */
6077 coding->src_multibyte = STRING_MULTIBYTE (str);
6078 coding->dst_multibyte = 0;
6079 if (! CODING_REQUIRE_ENCODING (coding))
6081 coding->consumed = SBYTES (str);
6082 coding->consumed_char = SCHARS (str);
6083 if (STRING_MULTIBYTE (str))
6085 str = Fstring_as_unibyte (str);
6086 nocopy = 1;
6088 coding->produced = SBYTES (str);
6089 coding->produced_char = SCHARS (str);
6090 return (nocopy ? str : Fcopy_sequence (str));
6093 if (coding->composing != COMPOSITION_DISABLED)
6094 coding_save_composition (coding, from, to, str);
6096 /* Try to skip the heading and tailing ASCIIs. */
6097 if (coding->type != coding_type_ccl)
6099 SHRINK_CONVERSION_REGION (&from, &to_byte, coding, SDATA (str),
6101 if (from == to_byte)
6102 return (nocopy ? str : Fcopy_sequence (str));
6103 shrinked_bytes = from + (SBYTES (str) - to_byte);
6106 len = encoding_buffer_size (coding, to_byte - from);
6107 allocate_conversion_buffer (buf, len);
6109 consumed = consumed_char = produced = produced_char = 0;
6110 while (1)
6112 result = encode_coding (coding, SDATA (str) + from + consumed,
6113 buf.data + produced, to_byte - from - consumed,
6114 buf.size - produced);
6115 consumed += coding->consumed;
6116 consumed_char += coding->consumed_char;
6117 produced += coding->produced;
6118 produced_char += coding->produced_char;
6119 if (result == CODING_FINISH_NORMAL
6120 || (result == CODING_FINISH_INSUFFICIENT_SRC
6121 && coding->consumed == 0))
6122 break;
6123 /* Now result should be CODING_FINISH_INSUFFICIENT_DST. */
6124 extend_conversion_buffer (&buf);
6127 coding->consumed = consumed;
6128 coding->consumed_char = consumed_char;
6129 coding->produced = produced;
6130 coding->produced_char = produced_char;
6132 newstr = make_uninit_string (produced + shrinked_bytes);
6133 if (from > 0)
6134 STRING_COPYIN (newstr, 0, SDATA (str), from);
6135 STRING_COPYIN (newstr, from, buf.data, produced);
6136 if (shrinked_bytes > from)
6137 STRING_COPYIN (newstr, from + produced,
6138 SDATA (str) + to_byte,
6139 shrinked_bytes - from);
6141 free_conversion_buffer (&buf);
6142 coding_free_composition_data (coding);
6144 return newstr;
6148 #ifdef emacs
6149 /*** 8. Emacs Lisp library functions ***/
6151 DEFUN ("coding-system-p", Fcoding_system_p, Scoding_system_p, 1, 1, 0,
6152 doc: /* Return t if OBJECT is nil or a coding-system.
6153 See the documentation of `make-coding-system' for information
6154 about coding-system objects. */)
6155 (obj)
6156 Lisp_Object obj;
6158 if (NILP (obj))
6159 return Qt;
6160 if (!SYMBOLP (obj))
6161 return Qnil;
6162 /* Get coding-spec vector for OBJ. */
6163 obj = Fget (obj, Qcoding_system);
6164 return ((VECTORP (obj) && XVECTOR (obj)->size == 5)
6165 ? Qt : Qnil);
6168 DEFUN ("read-non-nil-coding-system", Fread_non_nil_coding_system,
6169 Sread_non_nil_coding_system, 1, 1, 0,
6170 doc: /* Read a coding system from the minibuffer, prompting with string PROMPT. */)
6171 (prompt)
6172 Lisp_Object prompt;
6174 Lisp_Object val;
6177 val = Fcompleting_read (prompt, Vcoding_system_alist, Qnil,
6178 Qt, Qnil, Qcoding_system_history, Qnil, Qnil);
6180 while (SCHARS (val) == 0);
6181 return (Fintern (val, Qnil));
6184 DEFUN ("read-coding-system", Fread_coding_system, Sread_coding_system, 1, 2, 0,
6185 doc: /* Read a coding system from the minibuffer, prompting with string PROMPT.
6186 If the user enters null input, return second argument DEFAULT-CODING-SYSTEM. */)
6187 (prompt, default_coding_system)
6188 Lisp_Object prompt, default_coding_system;
6190 Lisp_Object val;
6191 if (SYMBOLP (default_coding_system))
6192 default_coding_system = SYMBOL_NAME (default_coding_system);
6193 val = Fcompleting_read (prompt, Vcoding_system_alist, Qnil,
6194 Qt, Qnil, Qcoding_system_history,
6195 default_coding_system, Qnil);
6196 return (SCHARS (val) == 0 ? Qnil : Fintern (val, Qnil));
6199 DEFUN ("check-coding-system", Fcheck_coding_system, Scheck_coding_system,
6200 1, 1, 0,
6201 doc: /* Check validity of CODING-SYSTEM.
6202 If valid, return CODING-SYSTEM, else signal a `coding-system-error' error.
6203 It is valid if it is a symbol with a non-nil `coding-system' property.
6204 The value of property should be a vector of length 5. */)
6205 (coding_system)
6206 Lisp_Object coding_system;
6208 CHECK_SYMBOL (coding_system);
6209 if (!NILP (Fcoding_system_p (coding_system)))
6210 return coding_system;
6211 while (1)
6212 Fsignal (Qcoding_system_error, Fcons (coding_system, Qnil));
6215 Lisp_Object
6216 detect_coding_system (src, src_bytes, highest, multibytep)
6217 const unsigned char *src;
6218 int src_bytes, highest;
6219 int multibytep;
6221 int coding_mask, eol_type;
6222 Lisp_Object val, tmp;
6223 int dummy;
6225 coding_mask = detect_coding_mask (src, src_bytes, NULL, &dummy, multibytep);
6226 eol_type = detect_eol_type (src, src_bytes, &dummy);
6227 if (eol_type == CODING_EOL_INCONSISTENT)
6228 eol_type = CODING_EOL_UNDECIDED;
6230 if (!coding_mask)
6232 val = Qundecided;
6233 if (eol_type != CODING_EOL_UNDECIDED)
6235 Lisp_Object val2;
6236 val2 = Fget (Qundecided, Qeol_type);
6237 if (VECTORP (val2))
6238 val = XVECTOR (val2)->contents[eol_type];
6240 return (highest ? val : Fcons (val, Qnil));
6243 /* At first, gather possible coding systems in VAL. */
6244 val = Qnil;
6245 for (tmp = Vcoding_category_list; CONSP (tmp); tmp = XCDR (tmp))
6247 Lisp_Object category_val, category_index;
6249 category_index = Fget (XCAR (tmp), Qcoding_category_index);
6250 category_val = Fsymbol_value (XCAR (tmp));
6251 if (!NILP (category_val)
6252 && NATNUMP (category_index)
6253 && (coding_mask & (1 << XFASTINT (category_index))))
6255 val = Fcons (category_val, val);
6256 if (highest)
6257 break;
6260 if (!highest)
6261 val = Fnreverse (val);
6263 /* Then, replace the elements with subsidiary coding systems. */
6264 for (tmp = val; CONSP (tmp); tmp = XCDR (tmp))
6266 if (eol_type != CODING_EOL_UNDECIDED
6267 && eol_type != CODING_EOL_INCONSISTENT)
6269 Lisp_Object eol;
6270 eol = Fget (XCAR (tmp), Qeol_type);
6271 if (VECTORP (eol))
6272 XSETCAR (tmp, XVECTOR (eol)->contents[eol_type]);
6275 return (highest ? XCAR (val) : val);
6278 DEFUN ("detect-coding-region", Fdetect_coding_region, Sdetect_coding_region,
6279 2, 3, 0,
6280 doc: /* Detect coding system of the text in the region between START and END.
6281 Return a list of possible coding systems ordered by priority.
6283 If only ASCII characters are found, it returns a list of single element
6284 `undecided' or its subsidiary coding system according to a detected
6285 end-of-line format.
6287 If optional argument HIGHEST is non-nil, return the coding system of
6288 highest priority. */)
6289 (start, end, highest)
6290 Lisp_Object start, end, highest;
6292 int from, to;
6293 int from_byte, to_byte;
6294 int include_anchor_byte = 0;
6296 CHECK_NUMBER_COERCE_MARKER (start);
6297 CHECK_NUMBER_COERCE_MARKER (end);
6299 validate_region (&start, &end);
6300 from = XINT (start), to = XINT (end);
6301 from_byte = CHAR_TO_BYTE (from);
6302 to_byte = CHAR_TO_BYTE (to);
6304 if (from < GPT && to >= GPT)
6305 move_gap_both (to, to_byte);
6306 /* If we an anchor byte `\0' follows the region, we include it in
6307 the detecting source. Then code detectors can handle the tailing
6308 byte sequence more accurately.
6310 Fix me: This is not a perfect solution. It is better that we
6311 add one more argument, say LAST_BLOCK, to all detect_coding_XXX.
6313 if (to == Z || (to == GPT && GAP_SIZE > 0))
6314 include_anchor_byte = 1;
6315 return detect_coding_system (BYTE_POS_ADDR (from_byte),
6316 to_byte - from_byte + include_anchor_byte,
6317 !NILP (highest),
6318 !NILP (current_buffer
6319 ->enable_multibyte_characters));
6322 DEFUN ("detect-coding-string", Fdetect_coding_string, Sdetect_coding_string,
6323 1, 2, 0,
6324 doc: /* Detect coding system of the text in STRING.
6325 Return a list of possible coding systems ordered by priority.
6327 If only ASCII characters are found, it returns a list of single element
6328 `undecided' or its subsidiary coding system according to a detected
6329 end-of-line format.
6331 If optional argument HIGHEST is non-nil, return the coding system of
6332 highest priority. */)
6333 (string, highest)
6334 Lisp_Object string, highest;
6336 CHECK_STRING (string);
6338 return detect_coding_system (SDATA (string),
6339 /* "+ 1" is to include the anchor byte
6340 `\0'. With this, code detectors can
6341 handle the tailing bytes more
6342 accurately. */
6343 SBYTES (string) + 1,
6344 !NILP (highest),
6345 STRING_MULTIBYTE (string));
6348 /* Return an intersection of lists L1 and L2. */
6350 static Lisp_Object
6351 intersection (l1, l2)
6352 Lisp_Object l1, l2;
6354 Lisp_Object val = Fcons (Qnil, Qnil), tail;
6356 for (tail = val; CONSP (l1); l1 = XCDR (l1))
6358 if (!NILP (Fmemq (XCAR (l1), l2)))
6360 XSETCDR (tail, Fcons (XCAR (l1), Qnil));
6361 tail = XCDR (tail);
6364 return XCDR (val);
6368 /* Subroutine for Fsafe_coding_systems_region_internal.
6370 Return a list of coding systems that safely encode the multibyte
6371 text between P and PEND. SAFE_CODINGS, if non-nil, is a list of
6372 possible coding systems. If it is nil, it means that we have not
6373 yet found any coding systems.
6375 WORK_TABLE is a copy of the char-table Vchar_coding_system_table. An
6376 element of WORK_TABLE is set to t once the element is looked up.
6378 If a non-ASCII single byte char is found, set
6379 *single_byte_char_found to 1. */
6381 static Lisp_Object
6382 find_safe_codings (p, pend, safe_codings, work_table, single_byte_char_found)
6383 unsigned char *p, *pend;
6384 Lisp_Object safe_codings, work_table;
6385 int *single_byte_char_found;
6387 int c, len, idx;
6388 Lisp_Object val;
6390 while (p < pend)
6392 c = STRING_CHAR_AND_LENGTH (p, pend - p, len);
6393 p += len;
6394 if (ASCII_BYTE_P (c))
6395 /* We can ignore ASCII characters here. */
6396 continue;
6397 if (SINGLE_BYTE_CHAR_P (c))
6398 *single_byte_char_found = 1;
6399 if (NILP (safe_codings))
6400 continue;
6401 /* Check the safe coding systems for C. */
6402 val = char_table_ref_and_index (work_table, c, &idx);
6403 if (EQ (val, Qt))
6404 /* This element was already checked. Ignore it. */
6405 continue;
6406 /* Remember that we checked this element. */
6407 CHAR_TABLE_SET (work_table, make_number (idx), Qt);
6409 /* If there are some safe coding systems for C and we have
6410 already found the other set of coding systems for the
6411 different characters, get the intersection of them. */
6412 if (!EQ (safe_codings, Qt) && !NILP (val))
6413 val = intersection (safe_codings, val);
6414 safe_codings = val;
6416 return safe_codings;
6420 /* Return a list of coding systems that safely encode the text between
6421 START and END. If the text contains only ASCII or is unibyte,
6422 return t. */
6424 DEFUN ("find-coding-systems-region-internal",
6425 Ffind_coding_systems_region_internal,
6426 Sfind_coding_systems_region_internal, 2, 2, 0,
6427 doc: /* Internal use only. */)
6428 (start, end)
6429 Lisp_Object start, end;
6431 Lisp_Object work_table, safe_codings;
6432 int non_ascii_p = 0;
6433 int single_byte_char_found = 0;
6434 const unsigned char *p1, *p1end, *p2, *p2end, *p;
6436 if (STRINGP (start))
6438 if (!STRING_MULTIBYTE (start))
6439 return Qt;
6440 p1 = SDATA (start), p1end = p1 + SBYTES (start);
6441 p2 = p2end = p1end;
6442 if (SCHARS (start) != SBYTES (start))
6443 non_ascii_p = 1;
6445 else
6447 int from, to, stop;
6449 CHECK_NUMBER_COERCE_MARKER (start);
6450 CHECK_NUMBER_COERCE_MARKER (end);
6451 if (XINT (start) < BEG || XINT (end) > Z || XINT (start) > XINT (end))
6452 args_out_of_range (start, end);
6453 if (NILP (current_buffer->enable_multibyte_characters))
6454 return Qt;
6455 from = CHAR_TO_BYTE (XINT (start));
6456 to = CHAR_TO_BYTE (XINT (end));
6457 stop = from < GPT_BYTE && GPT_BYTE < to ? GPT_BYTE : to;
6458 p1 = BYTE_POS_ADDR (from), p1end = p1 + (stop - from);
6459 if (stop == to)
6460 p2 = p2end = p1end;
6461 else
6462 p2 = BYTE_POS_ADDR (stop), p2end = p2 + (to - stop);
6463 if (XINT (end) - XINT (start) != to - from)
6464 non_ascii_p = 1;
6467 if (!non_ascii_p)
6469 /* We are sure that the text contains no multibyte character.
6470 Check if it contains eight-bit-graphic. */
6471 p = p1;
6472 for (p = p1; p < p1end && ASCII_BYTE_P (*p); p++);
6473 if (p == p1end)
6475 for (p = p2; p < p2end && ASCII_BYTE_P (*p); p++);
6476 if (p == p2end)
6477 return Qt;
6481 /* The text contains non-ASCII characters. */
6482 work_table = Fcopy_sequence (Vchar_coding_system_table);
6483 safe_codings = find_safe_codings (p1, p1end, Qt, work_table,
6484 &single_byte_char_found);
6485 if (p2 < p2end)
6486 safe_codings = find_safe_codings (p2, p2end, safe_codings, work_table,
6487 &single_byte_char_found);
6489 if (EQ (safe_codings, Qt))
6490 ; /* Nothing to be done. */
6491 else if (!single_byte_char_found)
6493 /* Append generic coding systems. */
6494 Lisp_Object args[2];
6495 args[0] = safe_codings;
6496 args[1] = Fchar_table_extra_slot (Vchar_coding_system_table,
6497 make_number (0));
6498 safe_codings = Fappend (2, args);
6500 else
6501 safe_codings = Fcons (Qraw_text,
6502 Fcons (Qemacs_mule,
6503 Fcons (Qno_conversion, safe_codings)));
6504 return safe_codings;
6508 static Lisp_Object
6509 find_safe_codings_2 (p, pend, safe_codings, work_table, single_byte_char_found)
6510 unsigned char *p, *pend;
6511 Lisp_Object safe_codings, work_table;
6512 int *single_byte_char_found;
6514 int c, len, i;
6515 Lisp_Object val, ch;
6516 Lisp_Object prev, tail;
6518 while (p < pend)
6520 c = STRING_CHAR_AND_LENGTH (p, pend - p, len);
6521 p += len;
6522 if (ASCII_BYTE_P (c))
6523 /* We can ignore ASCII characters here. */
6524 continue;
6525 if (SINGLE_BYTE_CHAR_P (c))
6526 *single_byte_char_found = 1;
6527 if (NILP (safe_codings))
6528 /* Already all coding systems are excluded. */
6529 continue;
6530 /* Check the safe coding systems for C. */
6531 ch = make_number (c);
6532 val = Faref (work_table, ch);
6533 if (EQ (val, Qt))
6534 /* This element was already checked. Ignore it. */
6535 continue;
6536 /* Remember that we checked this element. */
6537 Faset (work_table, ch, Qt);
6539 for (prev = tail = safe_codings; CONSP (tail); tail = XCDR (tail))
6541 val = XCAR (tail);
6542 if (NILP (Faref (XCDR (val), ch)))
6544 /* Exclued this coding system from SAFE_CODINGS. */
6545 if (EQ (tail, safe_codings))
6546 safe_codings = XCDR (safe_codings);
6547 else
6548 XSETCDR (prev, XCDR (tail));
6550 else
6551 prev = tail;
6554 return safe_codings;
6557 DEFUN ("find-coding-systems-region-internal-2",
6558 Ffind_coding_systems_region_internal_2,
6559 Sfind_coding_systems_region_internal_2, 2, 2, 0,
6560 doc: /* Internal use only. */)
6561 (start, end)
6562 Lisp_Object start, end;
6564 Lisp_Object work_table, safe_codings;
6565 int non_ascii_p = 0;
6566 int single_byte_char_found = 0;
6567 const unsigned char *p1, *p1end, *p2, *p2end, *p;
6569 if (STRINGP (start))
6571 if (!STRING_MULTIBYTE (start))
6572 return Qt;
6573 p1 = SDATA (start), p1end = p1 + SBYTES (start);
6574 p2 = p2end = p1end;
6575 if (SCHARS (start) != SBYTES (start))
6576 non_ascii_p = 1;
6578 else
6580 int from, to, stop;
6582 CHECK_NUMBER_COERCE_MARKER (start);
6583 CHECK_NUMBER_COERCE_MARKER (end);
6584 if (XINT (start) < BEG || XINT (end) > Z || XINT (start) > XINT (end))
6585 args_out_of_range (start, end);
6586 if (NILP (current_buffer->enable_multibyte_characters))
6587 return Qt;
6588 from = CHAR_TO_BYTE (XINT (start));
6589 to = CHAR_TO_BYTE (XINT (end));
6590 stop = from < GPT_BYTE && GPT_BYTE < to ? GPT_BYTE : to;
6591 p1 = BYTE_POS_ADDR (from), p1end = p1 + (stop - from);
6592 if (stop == to)
6593 p2 = p2end = p1end;
6594 else
6595 p2 = BYTE_POS_ADDR (stop), p2end = p2 + (to - stop);
6596 if (XINT (end) - XINT (start) != to - from)
6597 non_ascii_p = 1;
6600 if (!non_ascii_p)
6602 /* We are sure that the text contains no multibyte character.
6603 Check if it contains eight-bit-graphic. */
6604 p = p1;
6605 for (p = p1; p < p1end && ASCII_BYTE_P (*p); p++);
6606 if (p == p1end)
6608 for (p = p2; p < p2end && ASCII_BYTE_P (*p); p++);
6609 if (p == p2end)
6610 return Qt;
6614 /* The text contains non-ASCII characters. */
6616 work_table = Fmake_char_table (Qchar_coding_system, Qnil);
6617 safe_codings = Fcopy_sequence (XCDR (Vcoding_system_safe_chars));
6619 safe_codings = find_safe_codings_2 (p1, p1end, safe_codings, work_table,
6620 &single_byte_char_found);
6621 if (p2 < p2end)
6622 safe_codings = find_safe_codings_2 (p2, p2end, safe_codings, work_table,
6623 &single_byte_char_found);
6624 if (EQ (safe_codings, XCDR (Vcoding_system_safe_chars)))
6625 safe_codings = Qt;
6626 else
6628 /* Turn safe_codings to a list of coding systems... */
6629 Lisp_Object val;
6631 if (single_byte_char_found)
6632 /* ... and append these for eight-bit chars. */
6633 val = Fcons (Qraw_text,
6634 Fcons (Qemacs_mule, Fcons (Qno_conversion, Qnil)));
6635 else
6636 /* ... and append generic coding systems. */
6637 val = Fcopy_sequence (XCAR (Vcoding_system_safe_chars));
6639 for (; CONSP (safe_codings); safe_codings = XCDR (safe_codings))
6640 val = Fcons (XCAR (XCAR (safe_codings)), val);
6641 safe_codings = val;
6644 return safe_codings;
6648 /* Search from position POS for such characters that are unencodable
6649 accoding to SAFE_CHARS, and return a list of their positions. P
6650 points where in the memory the character at POS exists. Limit the
6651 search at PEND or when Nth unencodable characters are found.
6653 If SAFE_CHARS is a char table, an element for an unencodable
6654 character is nil.
6656 If SAFE_CHARS is nil, all non-ASCII characters are unencodable.
6658 Otherwise, SAFE_CHARS is t, and only eight-bit-contrl and
6659 eight-bit-graphic characters are unencodable. */
6661 static Lisp_Object
6662 unencodable_char_position (safe_chars, pos, p, pend, n)
6663 Lisp_Object safe_chars;
6664 int pos;
6665 unsigned char *p, *pend;
6666 int n;
6668 Lisp_Object pos_list;
6670 pos_list = Qnil;
6671 while (p < pend)
6673 int len;
6674 int c = STRING_CHAR_AND_LENGTH (p, MAX_MULTIBYTE_LENGTH, len);
6676 if (c >= 128
6677 && (CHAR_TABLE_P (safe_chars)
6678 ? NILP (CHAR_TABLE_REF (safe_chars, c))
6679 : (NILP (safe_chars) || c < 256)))
6681 pos_list = Fcons (make_number (pos), pos_list);
6682 if (--n <= 0)
6683 break;
6685 pos++;
6686 p += len;
6688 return Fnreverse (pos_list);
6692 DEFUN ("unencodable-char-position", Funencodable_char_position,
6693 Sunencodable_char_position, 3, 5, 0,
6694 doc: /*
6695 Return position of first un-encodable character in a region.
6696 START and END specfiy the region and CODING-SYSTEM specifies the
6697 encoding to check. Return nil if CODING-SYSTEM does encode the region.
6699 If optional 4th argument COUNT is non-nil, it specifies at most how
6700 many un-encodable characters to search. In this case, the value is a
6701 list of positions.
6703 If optional 5th argument STRING is non-nil, it is a string to search
6704 for un-encodable characters. In that case, START and END are indexes
6705 to the string. */)
6706 (start, end, coding_system, count, string)
6707 Lisp_Object start, end, coding_system, count, string;
6709 int n;
6710 Lisp_Object safe_chars;
6711 struct coding_system coding;
6712 Lisp_Object positions;
6713 int from, to;
6714 unsigned char *p, *pend;
6716 if (NILP (string))
6718 validate_region (&start, &end);
6719 from = XINT (start);
6720 to = XINT (end);
6721 if (NILP (current_buffer->enable_multibyte_characters))
6722 return Qnil;
6723 p = CHAR_POS_ADDR (from);
6724 if (to == GPT)
6725 pend = GPT_ADDR;
6726 else
6727 pend = CHAR_POS_ADDR (to);
6729 else
6731 CHECK_STRING (string);
6732 CHECK_NATNUM (start);
6733 CHECK_NATNUM (end);
6734 from = XINT (start);
6735 to = XINT (end);
6736 if (from > to
6737 || to > SCHARS (string))
6738 args_out_of_range_3 (string, start, end);
6739 if (! STRING_MULTIBYTE (string))
6740 return Qnil;
6741 p = SDATA (string) + string_char_to_byte (string, from);
6742 pend = SDATA (string) + string_char_to_byte (string, to);
6745 setup_coding_system (Fcheck_coding_system (coding_system), &coding);
6747 if (NILP (count))
6748 n = 1;
6749 else
6751 CHECK_NATNUM (count);
6752 n = XINT (count);
6755 if (coding.type == coding_type_no_conversion
6756 || coding.type == coding_type_raw_text)
6757 return Qnil;
6759 if (coding.type == coding_type_undecided)
6760 safe_chars = Qnil;
6761 else
6762 safe_chars = coding_safe_chars (coding_system);
6764 if (STRINGP (string)
6765 || from >= GPT || to <= GPT)
6766 positions = unencodable_char_position (safe_chars, from, p, pend, n);
6767 else
6769 Lisp_Object args[2];
6771 args[0] = unencodable_char_position (safe_chars, from, p, GPT_ADDR, n);
6772 n -= XINT (Flength (args[0]));
6773 if (n <= 0)
6774 positions = args[0];
6775 else
6777 args[1] = unencodable_char_position (safe_chars, GPT, GAP_END_ADDR,
6778 pend, n);
6779 positions = Fappend (2, args);
6783 return (NILP (count) ? Fcar (positions) : positions);
6787 Lisp_Object
6788 code_convert_region1 (start, end, coding_system, encodep)
6789 Lisp_Object start, end, coding_system;
6790 int encodep;
6792 struct coding_system coding;
6793 int from, to;
6795 CHECK_NUMBER_COERCE_MARKER (start);
6796 CHECK_NUMBER_COERCE_MARKER (end);
6797 CHECK_SYMBOL (coding_system);
6799 validate_region (&start, &end);
6800 from = XFASTINT (start);
6801 to = XFASTINT (end);
6803 if (NILP (coding_system))
6804 return make_number (to - from);
6806 if (setup_coding_system (Fcheck_coding_system (coding_system), &coding) < 0)
6807 error ("Invalid coding system: %s", SDATA (SYMBOL_NAME (coding_system)));
6809 coding.mode |= CODING_MODE_LAST_BLOCK;
6810 coding.src_multibyte = coding.dst_multibyte
6811 = !NILP (current_buffer->enable_multibyte_characters);
6812 code_convert_region (from, CHAR_TO_BYTE (from), to, CHAR_TO_BYTE (to),
6813 &coding, encodep, 1);
6814 Vlast_coding_system_used = coding.symbol;
6815 return make_number (coding.produced_char);
6818 DEFUN ("decode-coding-region", Fdecode_coding_region, Sdecode_coding_region,
6819 3, 3, "r\nzCoding system: ",
6820 doc: /* Decode the current region from the specified coding system.
6821 When called from a program, takes three arguments:
6822 START, END, and CODING-SYSTEM. START and END are buffer positions.
6823 This function sets `last-coding-system-used' to the precise coding system
6824 used (which may be different from CODING-SYSTEM if CODING-SYSTEM is
6825 not fully specified.)
6826 It returns the length of the decoded text. */)
6827 (start, end, coding_system)
6828 Lisp_Object start, end, coding_system;
6830 return code_convert_region1 (start, end, coding_system, 0);
6833 DEFUN ("encode-coding-region", Fencode_coding_region, Sencode_coding_region,
6834 3, 3, "r\nzCoding system: ",
6835 doc: /* Encode the current region into the specified coding system.
6836 When called from a program, takes three arguments:
6837 START, END, and CODING-SYSTEM. START and END are buffer positions.
6838 This function sets `last-coding-system-used' to the precise coding system
6839 used (which may be different from CODING-SYSTEM if CODING-SYSTEM is
6840 not fully specified.)
6841 It returns the length of the encoded text. */)
6842 (start, end, coding_system)
6843 Lisp_Object start, end, coding_system;
6845 return code_convert_region1 (start, end, coding_system, 1);
6848 Lisp_Object
6849 code_convert_string1 (string, coding_system, nocopy, encodep)
6850 Lisp_Object string, coding_system, nocopy;
6851 int encodep;
6853 struct coding_system coding;
6855 CHECK_STRING (string);
6856 CHECK_SYMBOL (coding_system);
6858 if (NILP (coding_system))
6859 return (NILP (nocopy) ? Fcopy_sequence (string) : string);
6861 if (setup_coding_system (Fcheck_coding_system (coding_system), &coding) < 0)
6862 error ("Invalid coding system: %s", SDATA (SYMBOL_NAME (coding_system)));
6864 coding.mode |= CODING_MODE_LAST_BLOCK;
6865 string = (encodep
6866 ? encode_coding_string (string, &coding, !NILP (nocopy))
6867 : decode_coding_string (string, &coding, !NILP (nocopy)));
6868 Vlast_coding_system_used = coding.symbol;
6870 return string;
6873 DEFUN ("decode-coding-string", Fdecode_coding_string, Sdecode_coding_string,
6874 2, 3, 0,
6875 doc: /* Decode STRING which is encoded in CODING-SYSTEM, and return the result.
6876 Optional arg NOCOPY non-nil means it is OK to return STRING itself
6877 if the decoding operation is trivial.
6878 This function sets `last-coding-system-used' to the precise coding system
6879 used (which may be different from CODING-SYSTEM if CODING-SYSTEM is
6880 not fully specified.) */)
6881 (string, coding_system, nocopy)
6882 Lisp_Object string, coding_system, nocopy;
6884 return code_convert_string1 (string, coding_system, nocopy, 0);
6887 DEFUN ("encode-coding-string", Fencode_coding_string, Sencode_coding_string,
6888 2, 3, 0,
6889 doc: /* Encode STRING to CODING-SYSTEM, and return the result.
6890 Optional arg NOCOPY non-nil means it is OK to return STRING itself
6891 if the encoding operation is trivial.
6892 This function sets `last-coding-system-used' to the precise coding system
6893 used (which may be different from CODING-SYSTEM if CODING-SYSTEM is
6894 not fully specified.) */)
6895 (string, coding_system, nocopy)
6896 Lisp_Object string, coding_system, nocopy;
6898 return code_convert_string1 (string, coding_system, nocopy, 1);
6901 /* Encode or decode STRING according to CODING_SYSTEM.
6902 Do not set Vlast_coding_system_used.
6904 This function is called only from macros DECODE_FILE and
6905 ENCODE_FILE, thus we ignore character composition. */
6907 Lisp_Object
6908 code_convert_string_norecord (string, coding_system, encodep)
6909 Lisp_Object string, coding_system;
6910 int encodep;
6912 struct coding_system coding;
6914 CHECK_STRING (string);
6915 CHECK_SYMBOL (coding_system);
6917 if (NILP (coding_system))
6918 return string;
6920 if (setup_coding_system (Fcheck_coding_system (coding_system), &coding) < 0)
6921 error ("Invalid coding system: %s", SDATA (SYMBOL_NAME (coding_system)));
6923 coding.composing = COMPOSITION_DISABLED;
6924 coding.mode |= CODING_MODE_LAST_BLOCK;
6925 return (encodep
6926 ? encode_coding_string (string, &coding, 1)
6927 : decode_coding_string (string, &coding, 1));
6930 DEFUN ("decode-sjis-char", Fdecode_sjis_char, Sdecode_sjis_char, 1, 1, 0,
6931 doc: /* Decode a Japanese character which has CODE in shift_jis encoding.
6932 Return the corresponding character. */)
6933 (code)
6934 Lisp_Object code;
6936 unsigned char c1, c2, s1, s2;
6937 Lisp_Object val;
6939 CHECK_NUMBER (code);
6940 s1 = (XFASTINT (code)) >> 8, s2 = (XFASTINT (code)) & 0xFF;
6941 if (s1 == 0)
6943 if (s2 < 0x80)
6944 XSETFASTINT (val, s2);
6945 else if (s2 >= 0xA0 || s2 <= 0xDF)
6946 XSETFASTINT (val, MAKE_CHAR (charset_katakana_jisx0201, s2, 0));
6947 else
6948 error ("Invalid Shift JIS code: %x", XFASTINT (code));
6950 else
6952 if ((s1 < 0x80 || (s1 > 0x9F && s1 < 0xE0) || s1 > 0xEF)
6953 || (s2 < 0x40 || s2 == 0x7F || s2 > 0xFC))
6954 error ("Invalid Shift JIS code: %x", XFASTINT (code));
6955 DECODE_SJIS (s1, s2, c1, c2);
6956 XSETFASTINT (val, MAKE_CHAR (charset_jisx0208, c1, c2));
6958 return val;
6961 DEFUN ("encode-sjis-char", Fencode_sjis_char, Sencode_sjis_char, 1, 1, 0,
6962 doc: /* Encode a Japanese character CHAR to shift_jis encoding.
6963 Return the corresponding code in SJIS. */)
6964 (ch)
6965 Lisp_Object ch;
6967 int charset, c1, c2, s1, s2;
6968 Lisp_Object val;
6970 CHECK_NUMBER (ch);
6971 SPLIT_CHAR (XFASTINT (ch), charset, c1, c2);
6972 if (charset == CHARSET_ASCII)
6974 val = ch;
6976 else if (charset == charset_jisx0208
6977 && c1 > 0x20 && c1 < 0x7F && c2 > 0x20 && c2 < 0x7F)
6979 ENCODE_SJIS (c1, c2, s1, s2);
6980 XSETFASTINT (val, (s1 << 8) | s2);
6982 else if (charset == charset_katakana_jisx0201
6983 && c1 > 0x20 && c2 < 0xE0)
6985 XSETFASTINT (val, c1 | 0x80);
6987 else
6988 error ("Can't encode to shift_jis: %d", XFASTINT (ch));
6989 return val;
6992 DEFUN ("decode-big5-char", Fdecode_big5_char, Sdecode_big5_char, 1, 1, 0,
6993 doc: /* Decode a Big5 character which has CODE in BIG5 coding system.
6994 Return the corresponding character. */)
6995 (code)
6996 Lisp_Object code;
6998 int charset;
6999 unsigned char b1, b2, c1, c2;
7000 Lisp_Object val;
7002 CHECK_NUMBER (code);
7003 b1 = (XFASTINT (code)) >> 8, b2 = (XFASTINT (code)) & 0xFF;
7004 if (b1 == 0)
7006 if (b2 >= 0x80)
7007 error ("Invalid BIG5 code: %x", XFASTINT (code));
7008 val = code;
7010 else
7012 if ((b1 < 0xA1 || b1 > 0xFE)
7013 || (b2 < 0x40 || (b2 > 0x7E && b2 < 0xA1) || b2 > 0xFE))
7014 error ("Invalid BIG5 code: %x", XFASTINT (code));
7015 DECODE_BIG5 (b1, b2, charset, c1, c2);
7016 XSETFASTINT (val, MAKE_CHAR (charset, c1, c2));
7018 return val;
7021 DEFUN ("encode-big5-char", Fencode_big5_char, Sencode_big5_char, 1, 1, 0,
7022 doc: /* Encode the Big5 character CHAR to BIG5 coding system.
7023 Return the corresponding character code in Big5. */)
7024 (ch)
7025 Lisp_Object ch;
7027 int charset, c1, c2, b1, b2;
7028 Lisp_Object val;
7030 CHECK_NUMBER (ch);
7031 SPLIT_CHAR (XFASTINT (ch), charset, c1, c2);
7032 if (charset == CHARSET_ASCII)
7034 val = ch;
7036 else if ((charset == charset_big5_1
7037 && (XFASTINT (ch) >= 0x250a1 && XFASTINT (ch) <= 0x271ec))
7038 || (charset == charset_big5_2
7039 && XFASTINT (ch) >= 0x290a1 && XFASTINT (ch) <= 0x2bdb2))
7041 ENCODE_BIG5 (charset, c1, c2, b1, b2);
7042 XSETFASTINT (val, (b1 << 8) | b2);
7044 else
7045 error ("Can't encode to Big5: %d", XFASTINT (ch));
7046 return val;
7049 DEFUN ("set-terminal-coding-system-internal", Fset_terminal_coding_system_internal,
7050 Sset_terminal_coding_system_internal, 1, 1, 0,
7051 doc: /* Internal use only. */)
7052 (coding_system)
7053 Lisp_Object coding_system;
7055 CHECK_SYMBOL (coding_system);
7056 setup_coding_system (Fcheck_coding_system (coding_system), &terminal_coding);
7057 /* We had better not send unsafe characters to terminal. */
7058 terminal_coding.flags |= CODING_FLAG_ISO_SAFE;
7059 /* Character composition should be disabled. */
7060 terminal_coding.composing = COMPOSITION_DISABLED;
7061 /* Error notification should be suppressed. */
7062 terminal_coding.suppress_error = 1;
7063 terminal_coding.src_multibyte = 1;
7064 terminal_coding.dst_multibyte = 0;
7065 return Qnil;
7068 DEFUN ("set-safe-terminal-coding-system-internal", Fset_safe_terminal_coding_system_internal,
7069 Sset_safe_terminal_coding_system_internal, 1, 1, 0,
7070 doc: /* Internal use only. */)
7071 (coding_system)
7072 Lisp_Object coding_system;
7074 CHECK_SYMBOL (coding_system);
7075 setup_coding_system (Fcheck_coding_system (coding_system),
7076 &safe_terminal_coding);
7077 /* Character composition should be disabled. */
7078 safe_terminal_coding.composing = COMPOSITION_DISABLED;
7079 /* Error notification should be suppressed. */
7080 terminal_coding.suppress_error = 1;
7081 safe_terminal_coding.src_multibyte = 1;
7082 safe_terminal_coding.dst_multibyte = 0;
7083 return Qnil;
7086 DEFUN ("terminal-coding-system", Fterminal_coding_system,
7087 Sterminal_coding_system, 0, 0, 0,
7088 doc: /* Return coding system specified for terminal output. */)
7091 return terminal_coding.symbol;
7094 DEFUN ("set-keyboard-coding-system-internal", Fset_keyboard_coding_system_internal,
7095 Sset_keyboard_coding_system_internal, 1, 1, 0,
7096 doc: /* Internal use only. */)
7097 (coding_system)
7098 Lisp_Object coding_system;
7100 CHECK_SYMBOL (coding_system);
7101 setup_coding_system (Fcheck_coding_system (coding_system), &keyboard_coding);
7102 /* Character composition should be disabled. */
7103 keyboard_coding.composing = COMPOSITION_DISABLED;
7104 return Qnil;
7107 DEFUN ("keyboard-coding-system", Fkeyboard_coding_system,
7108 Skeyboard_coding_system, 0, 0, 0,
7109 doc: /* Return coding system specified for decoding keyboard input. */)
7112 return keyboard_coding.symbol;
7116 DEFUN ("find-operation-coding-system", Ffind_operation_coding_system,
7117 Sfind_operation_coding_system, 1, MANY, 0,
7118 doc: /* Choose a coding system for an operation based on the target name.
7119 The value names a pair of coding systems: (DECODING-SYSTEM . ENCODING-SYSTEM).
7120 DECODING-SYSTEM is the coding system to use for decoding
7121 \(in case OPERATION does decoding), and ENCODING-SYSTEM is the coding system
7122 for encoding (in case OPERATION does encoding).
7124 The first argument OPERATION specifies an I/O primitive:
7125 For file I/O, `insert-file-contents' or `write-region'.
7126 For process I/O, `call-process', `call-process-region', or `start-process'.
7127 For network I/O, `open-network-stream'.
7129 The remaining arguments should be the same arguments that were passed
7130 to the primitive. Depending on which primitive, one of those arguments
7131 is selected as the TARGET. For example, if OPERATION does file I/O,
7132 whichever argument specifies the file name is TARGET.
7134 TARGET has a meaning which depends on OPERATION:
7135 For file I/O, TARGET is a file name.
7136 For process I/O, TARGET is a process name.
7137 For network I/O, TARGET is a service name or a port number
7139 This function looks up what specified for TARGET in,
7140 `file-coding-system-alist', `process-coding-system-alist',
7141 or `network-coding-system-alist' depending on OPERATION.
7142 They may specify a coding system, a cons of coding systems,
7143 or a function symbol to call.
7144 In the last case, we call the function with one argument,
7145 which is a list of all the arguments given to this function.
7147 usage: (find-operation-coding-system OPERATION ARGUMENTS ...) */)
7148 (nargs, args)
7149 int nargs;
7150 Lisp_Object *args;
7152 Lisp_Object operation, target_idx, target, val;
7153 register Lisp_Object chain;
7155 if (nargs < 2)
7156 error ("Too few arguments");
7157 operation = args[0];
7158 if (!SYMBOLP (operation)
7159 || !INTEGERP (target_idx = Fget (operation, Qtarget_idx)))
7160 error ("Invalid first argument");
7161 if (nargs < 1 + XINT (target_idx))
7162 error ("Too few arguments for operation: %s",
7163 SDATA (SYMBOL_NAME (operation)));
7164 /* For write-region, if the 6th argument (i.e. VISIT, the 5th
7165 argument to write-region) is string, it must be treated as a
7166 target file name. */
7167 if (EQ (operation, Qwrite_region)
7168 && nargs > 5
7169 && STRINGP (args[5]))
7170 target_idx = make_number (4);
7171 target = args[XINT (target_idx) + 1];
7172 if (!(STRINGP (target)
7173 || (EQ (operation, Qopen_network_stream) && INTEGERP (target))))
7174 error ("Invalid argument %d", XINT (target_idx) + 1);
7176 chain = ((EQ (operation, Qinsert_file_contents)
7177 || EQ (operation, Qwrite_region))
7178 ? Vfile_coding_system_alist
7179 : (EQ (operation, Qopen_network_stream)
7180 ? Vnetwork_coding_system_alist
7181 : Vprocess_coding_system_alist));
7182 if (NILP (chain))
7183 return Qnil;
7185 for (; CONSP (chain); chain = XCDR (chain))
7187 Lisp_Object elt;
7188 elt = XCAR (chain);
7190 if (CONSP (elt)
7191 && ((STRINGP (target)
7192 && STRINGP (XCAR (elt))
7193 && fast_string_match (XCAR (elt), target) >= 0)
7194 || (INTEGERP (target) && EQ (target, XCAR (elt)))))
7196 val = XCDR (elt);
7197 /* Here, if VAL is both a valid coding system and a valid
7198 function symbol, we return VAL as a coding system. */
7199 if (CONSP (val))
7200 return val;
7201 if (! SYMBOLP (val))
7202 return Qnil;
7203 if (! NILP (Fcoding_system_p (val)))
7204 return Fcons (val, val);
7205 if (! NILP (Ffboundp (val)))
7207 val = call1 (val, Flist (nargs, args));
7208 if (CONSP (val))
7209 return val;
7210 if (SYMBOLP (val) && ! NILP (Fcoding_system_p (val)))
7211 return Fcons (val, val);
7213 return Qnil;
7216 return Qnil;
7219 DEFUN ("update-coding-systems-internal", Fupdate_coding_systems_internal,
7220 Supdate_coding_systems_internal, 0, 0, 0,
7221 doc: /* Update internal database for ISO2022 and CCL based coding systems.
7222 When values of any coding categories are changed, you must
7223 call this function. */)
7226 int i;
7228 for (i = CODING_CATEGORY_IDX_EMACS_MULE; i < CODING_CATEGORY_IDX_MAX; i++)
7230 Lisp_Object val;
7232 val = SYMBOL_VALUE (XVECTOR (Vcoding_category_table)->contents[i]);
7233 if (!NILP (val))
7235 if (! coding_system_table[i])
7236 coding_system_table[i] = ((struct coding_system *)
7237 xmalloc (sizeof (struct coding_system)));
7238 setup_coding_system (val, coding_system_table[i]);
7240 else if (coding_system_table[i])
7242 xfree (coding_system_table[i]);
7243 coding_system_table[i] = NULL;
7247 return Qnil;
7250 DEFUN ("set-coding-priority-internal", Fset_coding_priority_internal,
7251 Sset_coding_priority_internal, 0, 0, 0,
7252 doc: /* Update internal database for the current value of `coding-category-list'.
7253 This function is internal use only. */)
7256 int i = 0, idx;
7257 Lisp_Object val;
7259 val = Vcoding_category_list;
7261 while (CONSP (val) && i < CODING_CATEGORY_IDX_MAX)
7263 if (! SYMBOLP (XCAR (val)))
7264 break;
7265 idx = XFASTINT (Fget (XCAR (val), Qcoding_category_index));
7266 if (idx >= CODING_CATEGORY_IDX_MAX)
7267 break;
7268 coding_priorities[i++] = (1 << idx);
7269 val = XCDR (val);
7271 /* If coding-category-list is valid and contains all coding
7272 categories, `i' should be CODING_CATEGORY_IDX_MAX now. If not,
7273 the following code saves Emacs from crashing. */
7274 while (i < CODING_CATEGORY_IDX_MAX)
7275 coding_priorities[i++] = CODING_CATEGORY_MASK_RAW_TEXT;
7277 return Qnil;
7280 DEFUN ("define-coding-system-internal", Fdefine_coding_system_internal,
7281 Sdefine_coding_system_internal, 1, 1, 0,
7282 doc: /* Register CODING-SYSTEM as a base coding system.
7283 This function is internal use only. */)
7284 (coding_system)
7285 Lisp_Object coding_system;
7287 Lisp_Object safe_chars, slot;
7289 if (NILP (Fcheck_coding_system (coding_system)))
7290 Fsignal (Qcoding_system_error, Fcons (coding_system, Qnil));
7291 safe_chars = coding_safe_chars (coding_system);
7292 if (! EQ (safe_chars, Qt) && ! CHAR_TABLE_P (safe_chars))
7293 error ("No valid safe-chars property for %s",
7294 SDATA (SYMBOL_NAME (coding_system)));
7295 if (EQ (safe_chars, Qt))
7297 if (NILP (Fmemq (coding_system, XCAR (Vcoding_system_safe_chars))))
7298 XSETCAR (Vcoding_system_safe_chars,
7299 Fcons (coding_system, XCAR (Vcoding_system_safe_chars)));
7301 else
7303 slot = Fassq (coding_system, XCDR (Vcoding_system_safe_chars));
7304 if (NILP (slot))
7305 XSETCDR (Vcoding_system_safe_chars,
7306 nconc2 (XCDR (Vcoding_system_safe_chars),
7307 Fcons (Fcons (coding_system, safe_chars), Qnil)));
7308 else
7309 XSETCDR (slot, safe_chars);
7311 return Qnil;
7314 #endif /* emacs */
7317 /*** 9. Post-amble ***/
7319 void
7320 init_coding_once ()
7322 int i;
7324 /* Emacs' internal format specific initialize routine. */
7325 for (i = 0; i <= 0x20; i++)
7326 emacs_code_class[i] = EMACS_control_code;
7327 emacs_code_class[0x0A] = EMACS_linefeed_code;
7328 emacs_code_class[0x0D] = EMACS_carriage_return_code;
7329 for (i = 0x21 ; i < 0x7F; i++)
7330 emacs_code_class[i] = EMACS_ascii_code;
7331 emacs_code_class[0x7F] = EMACS_control_code;
7332 for (i = 0x80; i < 0xFF; i++)
7333 emacs_code_class[i] = EMACS_invalid_code;
7334 emacs_code_class[LEADING_CODE_PRIVATE_11] = EMACS_leading_code_3;
7335 emacs_code_class[LEADING_CODE_PRIVATE_12] = EMACS_leading_code_3;
7336 emacs_code_class[LEADING_CODE_PRIVATE_21] = EMACS_leading_code_4;
7337 emacs_code_class[LEADING_CODE_PRIVATE_22] = EMACS_leading_code_4;
7339 /* ISO2022 specific initialize routine. */
7340 for (i = 0; i < 0x20; i++)
7341 iso_code_class[i] = ISO_control_0;
7342 for (i = 0x21; i < 0x7F; i++)
7343 iso_code_class[i] = ISO_graphic_plane_0;
7344 for (i = 0x80; i < 0xA0; i++)
7345 iso_code_class[i] = ISO_control_1;
7346 for (i = 0xA1; i < 0xFF; i++)
7347 iso_code_class[i] = ISO_graphic_plane_1;
7348 iso_code_class[0x20] = iso_code_class[0x7F] = ISO_0x20_or_0x7F;
7349 iso_code_class[0xA0] = iso_code_class[0xFF] = ISO_0xA0_or_0xFF;
7350 iso_code_class[ISO_CODE_CR] = ISO_carriage_return;
7351 iso_code_class[ISO_CODE_SO] = ISO_shift_out;
7352 iso_code_class[ISO_CODE_SI] = ISO_shift_in;
7353 iso_code_class[ISO_CODE_SS2_7] = ISO_single_shift_2_7;
7354 iso_code_class[ISO_CODE_ESC] = ISO_escape;
7355 iso_code_class[ISO_CODE_SS2] = ISO_single_shift_2;
7356 iso_code_class[ISO_CODE_SS3] = ISO_single_shift_3;
7357 iso_code_class[ISO_CODE_CSI] = ISO_control_sequence_introducer;
7359 setup_coding_system (Qnil, &keyboard_coding);
7360 setup_coding_system (Qnil, &terminal_coding);
7361 setup_coding_system (Qnil, &safe_terminal_coding);
7362 setup_coding_system (Qnil, &default_buffer_file_coding);
7364 bzero (coding_system_table, sizeof coding_system_table);
7366 bzero (ascii_skip_code, sizeof ascii_skip_code);
7367 for (i = 0; i < 128; i++)
7368 ascii_skip_code[i] = 1;
7370 #if defined (MSDOS) || defined (WINDOWSNT)
7371 system_eol_type = CODING_EOL_CRLF;
7372 #else
7373 system_eol_type = CODING_EOL_LF;
7374 #endif
7376 inhibit_pre_post_conversion = 0;
7379 #ifdef emacs
7381 void
7382 syms_of_coding ()
7384 Qtarget_idx = intern ("target-idx");
7385 staticpro (&Qtarget_idx);
7387 Qcoding_system_history = intern ("coding-system-history");
7388 staticpro (&Qcoding_system_history);
7389 Fset (Qcoding_system_history, Qnil);
7391 /* Target FILENAME is the first argument. */
7392 Fput (Qinsert_file_contents, Qtarget_idx, make_number (0));
7393 /* Target FILENAME is the third argument. */
7394 Fput (Qwrite_region, Qtarget_idx, make_number (2));
7396 Qcall_process = intern ("call-process");
7397 staticpro (&Qcall_process);
7398 /* Target PROGRAM is the first argument. */
7399 Fput (Qcall_process, Qtarget_idx, make_number (0));
7401 Qcall_process_region = intern ("call-process-region");
7402 staticpro (&Qcall_process_region);
7403 /* Target PROGRAM is the third argument. */
7404 Fput (Qcall_process_region, Qtarget_idx, make_number (2));
7406 Qstart_process = intern ("start-process");
7407 staticpro (&Qstart_process);
7408 /* Target PROGRAM is the third argument. */
7409 Fput (Qstart_process, Qtarget_idx, make_number (2));
7411 Qopen_network_stream = intern ("open-network-stream");
7412 staticpro (&Qopen_network_stream);
7413 /* Target SERVICE is the fourth argument. */
7414 Fput (Qopen_network_stream, Qtarget_idx, make_number (3));
7416 Qcoding_system = intern ("coding-system");
7417 staticpro (&Qcoding_system);
7419 Qeol_type = intern ("eol-type");
7420 staticpro (&Qeol_type);
7422 Qbuffer_file_coding_system = intern ("buffer-file-coding-system");
7423 staticpro (&Qbuffer_file_coding_system);
7425 Qpost_read_conversion = intern ("post-read-conversion");
7426 staticpro (&Qpost_read_conversion);
7428 Qpre_write_conversion = intern ("pre-write-conversion");
7429 staticpro (&Qpre_write_conversion);
7431 Qno_conversion = intern ("no-conversion");
7432 staticpro (&Qno_conversion);
7434 Qundecided = intern ("undecided");
7435 staticpro (&Qundecided);
7437 Qcoding_system_p = intern ("coding-system-p");
7438 staticpro (&Qcoding_system_p);
7440 Qcoding_system_error = intern ("coding-system-error");
7441 staticpro (&Qcoding_system_error);
7443 Fput (Qcoding_system_error, Qerror_conditions,
7444 Fcons (Qcoding_system_error, Fcons (Qerror, Qnil)));
7445 Fput (Qcoding_system_error, Qerror_message,
7446 build_string ("Invalid coding system"));
7448 Qcoding_category = intern ("coding-category");
7449 staticpro (&Qcoding_category);
7450 Qcoding_category_index = intern ("coding-category-index");
7451 staticpro (&Qcoding_category_index);
7453 Vcoding_category_table
7454 = Fmake_vector (make_number (CODING_CATEGORY_IDX_MAX), Qnil);
7455 staticpro (&Vcoding_category_table);
7457 int i;
7458 for (i = 0; i < CODING_CATEGORY_IDX_MAX; i++)
7460 XVECTOR (Vcoding_category_table)->contents[i]
7461 = intern (coding_category_name[i]);
7462 Fput (XVECTOR (Vcoding_category_table)->contents[i],
7463 Qcoding_category_index, make_number (i));
7467 Vcoding_system_safe_chars = Fcons (Qnil, Qnil);
7468 staticpro (&Vcoding_system_safe_chars);
7470 Qtranslation_table = intern ("translation-table");
7471 staticpro (&Qtranslation_table);
7472 Fput (Qtranslation_table, Qchar_table_extra_slots, make_number (1));
7474 Qtranslation_table_id = intern ("translation-table-id");
7475 staticpro (&Qtranslation_table_id);
7477 Qtranslation_table_for_decode = intern ("translation-table-for-decode");
7478 staticpro (&Qtranslation_table_for_decode);
7480 Qtranslation_table_for_encode = intern ("translation-table-for-encode");
7481 staticpro (&Qtranslation_table_for_encode);
7483 Qsafe_chars = intern ("safe-chars");
7484 staticpro (&Qsafe_chars);
7486 Qchar_coding_system = intern ("char-coding-system");
7487 staticpro (&Qchar_coding_system);
7489 /* Intern this now in case it isn't already done.
7490 Setting this variable twice is harmless.
7491 But don't staticpro it here--that is done in alloc.c. */
7492 Qchar_table_extra_slots = intern ("char-table-extra-slots");
7493 Fput (Qsafe_chars, Qchar_table_extra_slots, make_number (0));
7494 Fput (Qchar_coding_system, Qchar_table_extra_slots, make_number (2));
7496 Qvalid_codes = intern ("valid-codes");
7497 staticpro (&Qvalid_codes);
7499 Qemacs_mule = intern ("emacs-mule");
7500 staticpro (&Qemacs_mule);
7502 Qraw_text = intern ("raw-text");
7503 staticpro (&Qraw_text);
7505 defsubr (&Scoding_system_p);
7506 defsubr (&Sread_coding_system);
7507 defsubr (&Sread_non_nil_coding_system);
7508 defsubr (&Scheck_coding_system);
7509 defsubr (&Sdetect_coding_region);
7510 defsubr (&Sdetect_coding_string);
7511 defsubr (&Sfind_coding_systems_region_internal);
7512 defsubr (&Sfind_coding_systems_region_internal_2);
7513 defsubr (&Sunencodable_char_position);
7514 defsubr (&Sdecode_coding_region);
7515 defsubr (&Sencode_coding_region);
7516 defsubr (&Sdecode_coding_string);
7517 defsubr (&Sencode_coding_string);
7518 defsubr (&Sdecode_sjis_char);
7519 defsubr (&Sencode_sjis_char);
7520 defsubr (&Sdecode_big5_char);
7521 defsubr (&Sencode_big5_char);
7522 defsubr (&Sset_terminal_coding_system_internal);
7523 defsubr (&Sset_safe_terminal_coding_system_internal);
7524 defsubr (&Sterminal_coding_system);
7525 defsubr (&Sset_keyboard_coding_system_internal);
7526 defsubr (&Skeyboard_coding_system);
7527 defsubr (&Sfind_operation_coding_system);
7528 defsubr (&Supdate_coding_systems_internal);
7529 defsubr (&Sset_coding_priority_internal);
7530 defsubr (&Sdefine_coding_system_internal);
7532 DEFVAR_LISP ("coding-system-list", &Vcoding_system_list,
7533 doc: /* List of coding systems.
7535 Do not alter the value of this variable manually. This variable should be
7536 updated by the functions `make-coding-system' and
7537 `define-coding-system-alias'. */);
7538 Vcoding_system_list = Qnil;
7540 DEFVAR_LISP ("coding-system-alist", &Vcoding_system_alist,
7541 doc: /* Alist of coding system names.
7542 Each element is one element list of coding system name.
7543 This variable is given to `completing-read' as TABLE argument.
7545 Do not alter the value of this variable manually. This variable should be
7546 updated by the functions `make-coding-system' and
7547 `define-coding-system-alias'. */);
7548 Vcoding_system_alist = Qnil;
7550 DEFVAR_LISP ("coding-category-list", &Vcoding_category_list,
7551 doc: /* List of coding-categories (symbols) ordered by priority.
7553 On detecting a coding system, Emacs tries code detection algorithms
7554 associated with each coding-category one by one in this order. When
7555 one algorithm agrees with a byte sequence of source text, the coding
7556 system bound to the corresponding coding-category is selected. */);
7558 int i;
7560 Vcoding_category_list = Qnil;
7561 for (i = CODING_CATEGORY_IDX_MAX - 1; i >= 0; i--)
7562 Vcoding_category_list
7563 = Fcons (XVECTOR (Vcoding_category_table)->contents[i],
7564 Vcoding_category_list);
7567 DEFVAR_LISP ("coding-system-for-read", &Vcoding_system_for_read,
7568 doc: /* Specify the coding system for read operations.
7569 It is useful to bind this variable with `let', but do not set it globally.
7570 If the value is a coding system, it is used for decoding on read operation.
7571 If not, an appropriate element is used from one of the coding system alists:
7572 There are three such tables, `file-coding-system-alist',
7573 `process-coding-system-alist', and `network-coding-system-alist'. */);
7574 Vcoding_system_for_read = Qnil;
7576 DEFVAR_LISP ("coding-system-for-write", &Vcoding_system_for_write,
7577 doc: /* Specify the coding system for write operations.
7578 Programs bind this variable with `let', but you should not set it globally.
7579 If the value is a coding system, it is used for encoding of output,
7580 when writing it to a file and when sending it to a file or subprocess.
7582 If this does not specify a coding system, an appropriate element
7583 is used from one of the coding system alists:
7584 There are three such tables, `file-coding-system-alist',
7585 `process-coding-system-alist', and `network-coding-system-alist'.
7586 For output to files, if the above procedure does not specify a coding system,
7587 the value of `buffer-file-coding-system' is used. */);
7588 Vcoding_system_for_write = Qnil;
7590 DEFVAR_LISP ("last-coding-system-used", &Vlast_coding_system_used,
7591 doc: /* Coding system used in the latest file or process I/O. */);
7592 Vlast_coding_system_used = Qnil;
7594 DEFVAR_BOOL ("inhibit-eol-conversion", &inhibit_eol_conversion,
7595 doc: /* *Non-nil means always inhibit code conversion of end-of-line format.
7596 See info node `Coding Systems' and info node `Text and Binary' concerning
7597 such conversion. */);
7598 inhibit_eol_conversion = 0;
7600 DEFVAR_BOOL ("inherit-process-coding-system", &inherit_process_coding_system,
7601 doc: /* Non-nil means process buffer inherits coding system of process output.
7602 Bind it to t if the process output is to be treated as if it were a file
7603 read from some filesystem. */);
7604 inherit_process_coding_system = 0;
7606 DEFVAR_LISP ("file-coding-system-alist", &Vfile_coding_system_alist,
7607 doc: /* Alist to decide a coding system to use for a file I/O operation.
7608 The format is ((PATTERN . VAL) ...),
7609 where PATTERN is a regular expression matching a file name,
7610 VAL is a coding system, a cons of coding systems, or a function symbol.
7611 If VAL is a coding system, it is used for both decoding and encoding
7612 the file contents.
7613 If VAL is a cons of coding systems, the car part is used for decoding,
7614 and the cdr part is used for encoding.
7615 If VAL is a function symbol, the function must return a coding system
7616 or a cons of coding systems which are used as above. The function gets
7617 the arguments with which `find-operation-coding-system' was called.
7619 See also the function `find-operation-coding-system'
7620 and the variable `auto-coding-alist'. */);
7621 Vfile_coding_system_alist = Qnil;
7623 DEFVAR_LISP ("process-coding-system-alist", &Vprocess_coding_system_alist,
7624 doc: /* Alist to decide a coding system to use for a process I/O operation.
7625 The format is ((PATTERN . VAL) ...),
7626 where PATTERN is a regular expression matching a program name,
7627 VAL is a coding system, a cons of coding systems, or a function symbol.
7628 If VAL is a coding system, it is used for both decoding what received
7629 from the program and encoding what sent to the program.
7630 If VAL is a cons of coding systems, the car part is used for decoding,
7631 and the cdr part is used for encoding.
7632 If VAL is a function symbol, the function must return a coding system
7633 or a cons of coding systems which are used as above.
7635 See also the function `find-operation-coding-system'. */);
7636 Vprocess_coding_system_alist = Qnil;
7638 DEFVAR_LISP ("network-coding-system-alist", &Vnetwork_coding_system_alist,
7639 doc: /* Alist to decide a coding system to use for a network I/O operation.
7640 The format is ((PATTERN . VAL) ...),
7641 where PATTERN is a regular expression matching a network service name
7642 or is a port number to connect to,
7643 VAL is a coding system, a cons of coding systems, or a function symbol.
7644 If VAL is a coding system, it is used for both decoding what received
7645 from the network stream and encoding what sent to the network stream.
7646 If VAL is a cons of coding systems, the car part is used for decoding,
7647 and the cdr part is used for encoding.
7648 If VAL is a function symbol, the function must return a coding system
7649 or a cons of coding systems which are used as above.
7651 See also the function `find-operation-coding-system'. */);
7652 Vnetwork_coding_system_alist = Qnil;
7654 DEFVAR_LISP ("locale-coding-system", &Vlocale_coding_system,
7655 doc: /* Coding system to use with system messages.
7656 Also used for decoding keyboard input on X Window system. */);
7657 Vlocale_coding_system = Qnil;
7659 /* The eol mnemonics are reset in startup.el system-dependently. */
7660 DEFVAR_LISP ("eol-mnemonic-unix", &eol_mnemonic_unix,
7661 doc: /* *String displayed in mode line for UNIX-like (LF) end-of-line format. */);
7662 eol_mnemonic_unix = build_string (":");
7664 DEFVAR_LISP ("eol-mnemonic-dos", &eol_mnemonic_dos,
7665 doc: /* *String displayed in mode line for DOS-like (CRLF) end-of-line format. */);
7666 eol_mnemonic_dos = build_string ("\\");
7668 DEFVAR_LISP ("eol-mnemonic-mac", &eol_mnemonic_mac,
7669 doc: /* *String displayed in mode line for MAC-like (CR) end-of-line format. */);
7670 eol_mnemonic_mac = build_string ("/");
7672 DEFVAR_LISP ("eol-mnemonic-undecided", &eol_mnemonic_undecided,
7673 doc: /* *String displayed in mode line when end-of-line format is not yet determined. */);
7674 eol_mnemonic_undecided = build_string (":");
7676 DEFVAR_LISP ("enable-character-translation", &Venable_character_translation,
7677 doc: /* *Non-nil enables character translation while encoding and decoding. */);
7678 Venable_character_translation = Qt;
7680 DEFVAR_LISP ("standard-translation-table-for-decode",
7681 &Vstandard_translation_table_for_decode,
7682 doc: /* Table for translating characters while decoding. */);
7683 Vstandard_translation_table_for_decode = Qnil;
7685 DEFVAR_LISP ("standard-translation-table-for-encode",
7686 &Vstandard_translation_table_for_encode,
7687 doc: /* Table for translating characters while encoding. */);
7688 Vstandard_translation_table_for_encode = Qnil;
7690 DEFVAR_LISP ("charset-revision-table", &Vcharset_revision_alist,
7691 doc: /* Alist of charsets vs revision numbers.
7692 While encoding, if a charset (car part of an element) is found,
7693 designate it with the escape sequence identifying revision (cdr part of the element). */);
7694 Vcharset_revision_alist = Qnil;
7696 DEFVAR_LISP ("default-process-coding-system",
7697 &Vdefault_process_coding_system,
7698 doc: /* Cons of coding systems used for process I/O by default.
7699 The car part is used for decoding a process output,
7700 the cdr part is used for encoding a text to be sent to a process. */);
7701 Vdefault_process_coding_system = Qnil;
7703 DEFVAR_LISP ("latin-extra-code-table", &Vlatin_extra_code_table,
7704 doc: /* Table of extra Latin codes in the range 128..159 (inclusive).
7705 This is a vector of length 256.
7706 If Nth element is non-nil, the existence of code N in a file
7707 \(or output of subprocess) doesn't prevent it to be detected as
7708 a coding system of ISO 2022 variant which has a flag
7709 `accept-latin-extra-code' t (e.g. iso-latin-1) on reading a file
7710 or reading output of a subprocess.
7711 Only 128th through 159th elements has a meaning. */);
7712 Vlatin_extra_code_table = Fmake_vector (make_number (256), Qnil);
7714 DEFVAR_LISP ("select-safe-coding-system-function",
7715 &Vselect_safe_coding_system_function,
7716 doc: /* Function to call to select safe coding system for encoding a text.
7718 If set, this function is called to force a user to select a proper
7719 coding system which can encode the text in the case that a default
7720 coding system used in each operation can't encode the text.
7722 The default value is `select-safe-coding-system' (which see). */);
7723 Vselect_safe_coding_system_function = Qnil;
7725 DEFVAR_BOOL ("coding-system-require-warning",
7726 &coding_system_require_warning,
7727 doc: /* Internal use only.
7728 If non-nil, on writing a file, `select-safe-coding-system-function' is
7729 called even if `coding-system-for-write' is non-nil. The command
7730 `universal-coding-system-argument' binds this variable to t temporarily. */);
7731 coding_system_require_warning = 0;
7734 DEFVAR_LISP ("char-coding-system-table", &Vchar_coding_system_table,
7735 doc: /* Char-table containing safe coding systems of each characters.
7736 Each element doesn't include such generic coding systems that can
7737 encode any characters. They are in the first extra slot. */);
7738 Vchar_coding_system_table = Fmake_char_table (Qchar_coding_system, Qnil);
7740 DEFVAR_BOOL ("inhibit-iso-escape-detection",
7741 &inhibit_iso_escape_detection,
7742 doc: /* If non-nil, Emacs ignores ISO2022's escape sequence on code detection.
7744 By default, on reading a file, Emacs tries to detect how the text is
7745 encoded. This code detection is sensitive to escape sequences. If
7746 the sequence is valid as ISO2022, the code is determined as one of
7747 the ISO2022 encodings, and the file is decoded by the corresponding
7748 coding system (e.g. `iso-2022-7bit').
7750 However, there may be a case that you want to read escape sequences in
7751 a file as is. In such a case, you can set this variable to non-nil.
7752 Then, as the code detection ignores any escape sequences, no file is
7753 detected as encoded in some ISO2022 encoding. The result is that all
7754 escape sequences become visible in a buffer.
7756 The default value is nil, and it is strongly recommended not to change
7757 it. That is because many Emacs Lisp source files that contain
7758 non-ASCII characters are encoded by the coding system `iso-2022-7bit'
7759 in Emacs's distribution, and they won't be decoded correctly on
7760 reading if you suppress escape sequence detection.
7762 The other way to read escape sequences in a file without decoding is
7763 to explicitly specify some coding system that doesn't use ISO2022's
7764 escape sequence (e.g `latin-1') on reading by \\[universal-coding-system-argument]. */);
7765 inhibit_iso_escape_detection = 0;
7767 DEFVAR_LISP ("translation-table-for-input", &Vtranslation_table_for_input,
7768 doc: /* Char table for translating self-inserting characters.
7769 This is applied to the result of input methods, not their input. See also
7770 `keyboard-translate-table'. */);
7771 Vtranslation_table_for_input = Qnil;
7774 char *
7775 emacs_strerror (error_number)
7776 int error_number;
7778 char *str;
7780 synchronize_system_messages_locale ();
7781 str = strerror (error_number);
7783 if (! NILP (Vlocale_coding_system))
7785 Lisp_Object dec = code_convert_string_norecord (build_string (str),
7786 Vlocale_coding_system,
7788 str = (char *) SDATA (dec);
7791 return str;
7794 #endif /* emacs */