Commit graph

32 commits

Author SHA1 Message Date
Lysandros Nikolaou 4b5d3e0e72
gh-120343: Fix column offsets of multiline tokens in tokenize (#120391) 2024-06-12 20:52:55 +02:00
Lysandros Nikolaou 1b62bcee94
gh-120343: Do not reset byte_col_offset_diff after multiline tokens (#120352)
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
2024-06-11 17:00:53 +00:00
Kirill Podoprigora c0faade891
gh-119704: Fix reference leak in the `Python/Python-tokenize.c` (#119705) 2024-05-29 07:56:44 +01:00
Lysandros Nikolaou d87b015106
gh-119118: Fix performance regression in tokenize module (#119615)
* gh-119118: Fix performance regression in tokenize module

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
2024-05-28 19:17:49 +00:00
Brett Simmers c2627d6eea
gh-116322: Add Py_mod_gil module slot (#116882)
This PR adds the ability to enable the GIL if it was disabled at
interpreter startup, and modifies the multi-phase module initialization
path to enable the GIL when loading a module, unless that module's spec
includes a slot indicating it can run safely without the GIL.

PEP 703 called the constant for the slot `Py_mod_gil_not_used`; I went
with `Py_MOD_GIL_NOT_USED` for consistency with gh-104148.

A warning will be issued up to once per interpreter for the first
GIL-using module that is loaded. If `-v` is given, a shorter message
will be printed to stderr every time a GIL-using module is loaded
(including the first one that issues a warning).
2024-05-03 11:30:55 -04:00
Pablo Galindo Salgado a135a6d2c6
gh-112943: Correctly compute end offsets for multiline tokens in the tokenize module (#112949) 2023-12-11 11:44:22 +00:00
Serhiy Storchaka a12f624a9d
Remove unnecessary includes (GH-111633) 2023-11-02 10:42:58 +02:00
Lysandros Nikolaou 01481f2dc1
gh-104169: Refactor tokenizer into lexer and wrappers (#110684)
* The lexer, which include the actual lexeme producing logic, goes into
  the `lexer` directory.
* The wrappers, one wrapper per input mode (file, string, utf-8, and
  readline), go into the `tokenizer` directory and include logic for
  creating a lexer instance and managing the buffer for different modes.
---------

Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
2023-10-11 15:14:44 +00:00
Pablo Galindo Salgado da8f87b7ea
gh-107015: Remove async_hacks from the tokenizer (#107018) 2023-07-26 16:34:15 +01:00
Pablo Galindo Salgado d7f46bcd98
gh-105564: Don't include artificial newlines in the line attribute of tokens (#105565) 2023-06-09 17:01:26 +01:00
Kirill Podoprigora 264a0110ff
gh-105390: Add explicit type cast (#105466) 2023-06-07 20:20:43 +00:00
Pablo Galindo Salgado 7279fb6408
gh-105435: Fix spurious NEWLINE token if file ends with comment without a newline (#105442) 2023-06-07 13:31:48 +01:00
Pablo Galindo Salgado ffd2654550
gh-105390: Correctly raise TokenError instead of SyntaxError for tokenize errors (#105399) 2023-06-07 12:04:40 +01:00
Pablo Galindo Salgado c0a6ed3934
gh-105259: Ensure we don't show newline characters for trailing NEWLINE tokens (#105364) 2023-06-06 12:52:16 +01:00
Lysandros Nikolaou 70f315c2d6
gh-105042: Disable unmatched parens syntax error in python tokenize (#105061) 2023-05-30 22:52:52 +01:00
Pablo Galindo Salgado 9216e69a87
gh-105069: Add a readline-like callable to the tokenizer to consume input iteratively (#105070) 2023-05-30 22:43:34 +01:00
Marta Gómez Macías 96fff35325
gh-105017: Include CRLF lines in strings and column numbers (#105030)
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
2023-05-28 15:15:53 +01:00
Pablo Galindo Salgado 46b52e6e2b
gh-104976: Ensure trailing dedent tokens are emitted as the previous tokenizer (#104980)
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
2023-05-26 22:02:26 +01:00
Pablo Galindo Salgado 3fdb55c482
gh-104972: Ensure that line attributes in tokens in the tokenize module are correct (#104975) 2023-05-26 15:46:22 +01:00
Pablo Galindo Salgado c8cf9b42eb
gh-104825: Remove implicit newline in the line attribute in tokens emitted in the tokenize module (#104846) 2023-05-24 09:59:18 +00:00
Marta Gómez Macías 729b252241
gh-104741: Add line number attribute to indentation error exception (#104743) 2023-05-22 11:30:18 +00:00
Marta Gómez Macías 8817886ae5
gh-102856: Tokenize performance improvement (#104731) 2023-05-22 00:29:04 +00:00
Marta Gómez Macías 6715f91edc
gh-102856: Python tokenizer implementation for PEP 701 (#104323)
This commit replaces the Python implementation of the tokenize module with an implementation
that reuses the real C tokenizer via a private extension module. The tokenize module now implements
a compatibility layer that transforms tokens from the C tokenizer into Python tokenize tokens for backward
compatibility.

As the C tokenizer does not emit some tokens that the Python tokenizer provides (such as comments and non-semantic newlines), a new special mode has been added to the C tokenizer mode that currently is only used via
the extension module that exposes it to the Python layer. This new mode forces the C tokenizer to emit these new extra tokens and add the appropriate metadata that is needed to match the old Python implementation.

Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
2023-05-21 01:03:02 +01:00
Eric Snow a9c6e0618f
gh-99113: Add Py_MOD_PER_INTERPRETER_GIL_SUPPORTED (gh-104205)
Here we are doing no more than adding the value for Py_mod_multiple_interpreters and using it for stdlib modules.  We will start checking for it in gh-104206 (once PyInterpreterState.ceval.own_gil is added in gh-104204).
2023-05-05 21:11:27 +00:00
Pablo Galindo Salgado 1ef61cf71a
gh-102856: Initial implementation of PEP 701 (#102855)
Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com>
Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>
Co-authored-by: Marta Gómez Macías <mgmacias@google.com>
Co-authored-by: sunmy2019 <59365878+sunmy2019@users.noreply.github.com>
2023-04-19 11:18:16 -05:00
Lysandros Nikolaou cbf0afd8a1
gh-97973: Return all necessary information from the tokenizer (GH-97984)
Right now, the tokenizer only returns type and two pointers to the start and end of the token.
This PR modifies the tokenizer to return the type and set all of the necessary information,
so that the parser does not have to this.
2022-10-06 16:07:17 -07:00
Eric Snow 6f6a4e6cc5
gh-90928: Statically Initialize the Keywords Tuple in Clinic-Generated Code (gh-95860)
We only statically initialize for core code and builtin modules.  Extension modules still create
the tuple at runtime.  We'll solve that part of interpreter isolation separately.

This change includes generated code. The non-generated changes are in:

* Tools/clinic/clinic.py
* Python/getargs.c
* Include/cpython/modsupport.h
* Makefile.pre.in (re-generate global strings after running clinic)
* very minor tweaks to Modules/_codecsmodule.c and Python/Python-tokenize.c

All other changes are generated code (clinic, global strings).
2022-08-11 15:25:49 -06:00
Petr Viktorin 204946986f
bpo-46613: Add PyType_GetModuleByDef to the public API (GH-31081)
* Make PyType_GetModuleByDef public (remove underscore)

Co-authored-by: Victor Stinner <vstinner@python.org>
2022-02-11 17:22:11 +01:00
Victor Stinner 713bb19356
bpo-45434: Mark the PyTokenizer C API as private (GH-28924)
Rename PyTokenize functions to mark them as private:

* PyTokenizer_FindEncodingFilename() => _PyTokenizer_FindEncodingFilename()
* PyTokenizer_FromString() => _PyTokenizer_FromString()
* PyTokenizer_FromFile() => _PyTokenizer_FromFile()
* PyTokenizer_FromUTF8() => _PyTokenizer_FromUTF8()
* PyTokenizer_Free() => _PyTokenizer_Free()
* PyTokenizer_Get() => _PyTokenizer_Get()

Remove the unused PyTokenizer_FindEncoding() function.

import.c: remove unused #include "errcode.h".
2021-10-13 17:22:14 +02:00
Serhiy Storchaka a5a56154f1
Remove trailing spaces. (GH-28706) 2021-10-03 16:58:14 +03:00
Pablo Galindo Salgado 214c2e5d91
Format the Python-tokenize module and fix exit path (GH-27935) 2021-08-25 14:41:14 +02:00
Pablo Galindo Salgado a24676bedc
Add tests for the C tokenizer and expose it as a private module (GH-27924) 2021-08-24 17:50:05 +01:00