Previously we stepped through the document character by character which involved a lot of extra processing inside OrgSource. By scanning for possible keywords, we can skip many of the intermediate steps.
These functions aren't exposed to the public so we can confidently say that if they work in dev then they will work in production. Removing these asserts theoretically should result in a speedup.
Currently this is a copy of the Source trait but it will grow to more functions. The Source trait is restricted to this crate in anticipation of its removal in favor of StandardProperties.
It is possible to have two headlines that have the same level but different star counts when set to Odd because of rounding. Deciding nesting by star count instead of headline level avoids this issue.
This fixes an issue where lines in a paragraph were incorrectly getting identified as lists because I had defaulted to assuming alphabetical bullets were allowed.
Emacs will auto-align tables when org-mode is loaded if the document contains "#+STARTUP: align". Since Organic is just a parser, it has no business editing the input it receives so we are disabling this auto-align in Emacs to make the tests work properly.
This bug probably exists in hundreds of places across the code base. I am going to have to write a "fuzzer" that replaces random whitespace with tabs to find them all.
This test will grab documents from external sources and compare Organic's parser vs the official org-mode parser to ensure they are parsing the same. This is so we do not introduce large irrelevant documents in the git history and so we do not introduce documents with restrictive licenses into the repository.
This change should hopefully allow for matchers to have captured borrowed values, it should eliminate the use of heap-allocated reference counting on the context nodes, and it adds in a global settings struct for passing around values that do not change during parsing.
We were running into issues where the documents grew too large for being passed as a string to emacs, and we need to handle #+setupfile so we need to start handling org-mode documents as files and not just as anonymous streams of text.
The anonymous stream of text handling will remain because the automated tests use it.
This shows that greater blocks of different types can be directly nested even if they are both special blocks (as long as they are different special blocks).
We were re-parsing each footnote definition in the footnote definition exit matcher which causes their contents to get re-parsed. This compounds with long lists of footnote definitions.
It is currently unknown if this will produce a performance increase, but unless it has a significant performance penalty we are going to go forward with this change because it makes it more explicit which values need to be read deeply from other elements (therefore needing to be in the context) vs values that can be bound to the exit matcher since they are only used within the confines of the current element.
I suspect we will get a performance boost since it will be reducing the nodes that need to be walked in the context but maintaining bracket depth count over the entire document instead of only inside elements that need balanced brackets could cost us.
The documentation currently states that the body for these cannot span more than three lines but that is not the behavior I am seeing from emacs in practice. Waiting on a mailing list response to tell me if this is a documentation error or a parser error.
This was previously using the standard docker makefile I use as a starting point for all of my docker makefiles. Now it will properly mount the source directory.
This is not meant to produce publishable or comparable benchmarks. Such a script would have to run many iterations with the input already loaded into memory, proper prioritization via nice/ionice, and have a warm-up phase. This is just automating a basic test I am frequently running to compare parse times when investigating performance issues.
This is an optimization. When you have something like plain text which ends when it hits the next element, we only need to parse enough to detect that an element is about to occur. For elements like plain lists, this is as simple as parsing a line starting with optional whitespace and then a bullet, which avoids parsing the entire plain list tree. The benefit is most noticeable in deeply nested plain lists.
Paragraph's exit matcher which detects elements was causing the plain list parser to exit after the first item was parsed which was causing significant amounts of re-parsing.
This can be done a lot more efficiently now that we are keeping track of this information in the wrapped input type instead of having to fetch to the original document out of the context tree.
I used the wrong word. This is referring to the priority between paragraphs ending vs export snippets ending, not a reference to something occurring in the past.
This is to be a good citizen by not downloading all the rust dependencies every time I run the tests locally. Unfortunately, it will still compile all the dependencies each time, but that is a local operation.
The behavior in emacs does not match the description in the org-mode documentation. I have sent an email to the org-mode mailing list and I am waiting their response so I can adjust (or not adjust) my parser accordingly.
We are including a bunch of files that are not needed for running the rust code. This excludes them to be a better citizen to both crates.io and all users of this package.
2023-08-11 00:11:54 -04:00
249 changed files with 10244 additions and 4396 deletions
# 4317 for OTLP gRPC, 4318 for OTLP HTTP. We currently use gRPC but I forward both ports regardless.
#
# These flags didn't help even though they seem like they would: --collector.otlp.grpc.max-message-size=10000000 --collector.queue-size=20000 --collector.num-workers=100
Organic is an emacs-less implementation of an [org-mode](https://orgmode.org/) parser.
## Project Status
This project is a personal learning project to grow my experience in [rust](https://www.rust-lang.org/). It is under development and at this time I would not recommend anyone use this code. The goal is to turn this into a project others can use, at which point more information will appear in this README.
This project is still under HEAVY development. While the version remains v0.1.x the API will be changing often. Once we hit v0.2.x we will start following semver.
Currently, the parser is able to correctly identify the start/end bounds of all the org-mode objects and elements (except table.el tables, org-mode tables are supported) but many of the interior properties are not yet populated.
### Project Goals
- We aim to provide perfect parity with the emacs org-mode parser. In that regard, any document that parses differently between Emacs and Organic is considered a bug.
- The parser should be fast. We're not doing anything special, but since this is written in Rust and natively compiled we should be able to beat the existing parsers.
- The parser should have minimal dependencies. This should reduce effort w.r.t.: security audits, legal compliance, portability.
- The parser should be usable everywhere. In the interest of getting org-mode used in as many places as possible, this parser should be usable by everyone everywhere. This means:
- It must have a permissive license for use in proprietary code bases.
- We will investigate compiling to WASM. This is an important goal of the project and will definitely happen, but only after the parser has a more stable API.
- We will investigate compiling to a C library for native linking to other code. This is more of a maybe-goal for the project.
### Project Non-Goals
- This project will not include an elisp engine since that would drastically increase the complexity of the code. Any features requiring an elisp engine will not be implemented (for example, Emacs supports embedded eval expressions in documents but this parser will never support that).
- This project is exclusively an org-mode **parser**. This limits its scope to roughly the output of `(org-element-parse-buffer)`. It will not render org-mode documents in other formats like HTML or LaTeX.
### Project Maybe-Goals
- table.el support. Currently we support org-mode tables but org-mode also allows table.el tables. So far, their use in org-mode documents seems rather uncommon so this is a low-priority feature.
- Document editing support. I do not anticipate any advanced editing features to make editing ergonomic, but it should be relatively easy to be able to parse an org-mode document and serialize it back into org-mode. This would enable cool features to be built on top of the library like auto-formatters. To accomplish this feature, We'd have to capture all of the various separators and whitespace that we are currently simply throwing away. This would add many additional fields to the parsed structs and it would add more noise to the parsers themselves, so I do not want to approach this feature until the parser is more complete since it would make modifications and refactoring more difficult.
### Supported Versions
This project targets the version of Emacs and Org-mode that are built into the [organic-test docker image](docker/organic_test/Dockerfile). This is newer than the version of Org-mode that shipped with Emacs 29.1. The parser itself does not depend on Emacs or Org-mode though, so this only matters for development purposes when running the automated tests that compare against upstream Org-mode.
## Using this library
TODO: Add section on using Organic as a library (which is the intended use for this project). This will be added when we have a bit more API stability since currently the library is under heavy development.
## Development
### The parse binary
This program takes org-mode input either streamed in on stdin or as paths to files passed in as arguments. It then parses them using Organic and dumps the result to stdout. This program is intended solely as a development tool. Examples:
This program takes org-mode input either streamed in on stdin or as paths to files passed in as arguments. It then parses them using Organic and the official Emacs Org-mode parser and compares the parse result. This program is intended solely as a development tool. Since org-mode is a moving target, it is recommended that you run this through docker since we pin the version of org-mode to a specific revision. Examples:
There are three levels of tests for this repository: the standard tests, the autogenerated tests, and the foreign document tests.
### The standard tests
These are regular hand-written rust tests. These can be run with:
```bash
make unittest
```
### The auto-generated tests
These tests are automatically generated from the files in the `org_mode_samples` directory and they are still integrated with the rust/cargo testing framework. For each org-mode document in that folder, a test is generated that will parse the document with both Organic and the official Emacs Org-mode parser and then it will compare the parse results. Any deviation is considered a failure. Since org-mode is a moving target, it is recommended that you run these tests inside docker since the `organic-test` docker image is pinned to a specific revision of org-mode. These can be run with:
```bash
make dockertest
```
### The foreign document tests
These tests function the same as the auto-generated tests except they are **not** integrated with the rust/cargo testing framework and they involve comparing the parse of org-mode documents that live outside this repository. This allows us to test against a far greater variety of org-mode input documents without pulling massive sets of org-mode documents into this repository. The recommended way to run these tests is still through docker because it pins org-mode and the test documents to specific git revisions. These can be run with:
```bash
make foreign_document_test
```
## License
This project is released under the public-domain-equivalent [0BSD license](https://www.tldrlegal.com/license/bsd-0-clause-license). This license puts no restrictions on the use of this code (you do not even have to include the copyright notice or license text when using it). HOWEVER, this project has a couple permissively licensed dependencies which do require their copyright notices and/or license texts to be included. I am not a lawyer and this is not legal advice but it is my layperson's understanding that if you distribute a binary with this library linked in, you will need to abide by their terms since their code will also be linked in your binary. I try to keep the dependencies to a minimum and the most restrictive dependency I will ever include is a permissively licensed one.
This project is released under the public-domain-equivalent [0BSD license](https://www.tldrlegal.com/license/bsd-0-clause-license), however, this project has a couple permissively licensed non-public-domain-equivalent dependencies which require their copyright notices and/or license texts to be included. I am not a lawyer and this is not legal advice but it is my layperson's understanding that if you distribute a binary statically linking this library, you will need to abide by their terms since their code will also be linked in your binary.
"drawer_drawer_with_headline_inside"=>Some("Apparently lines with :end: become their own paragraph. This odd behavior needs to be investigated more."),
"element_container_priority_footnote_definition_dynamic_block"=>Some("Apparently broken begin lines become their own paragraph."),
"paragraphs_paragraph_with_backslash_line_breaks"=>Some("The text we're getting out of the parse tree is already processed to remove line breaks, so our comparison needs to take that into account."),
"export_snippet_paragraph_break_precedent"=>Some("Emacs 28 has broken behavior so the tests in the CI fail."),
"autogen_greater_element_drawer_drawer_with_headline_inside"=>Some("Apparently lines with :end: become their own paragraph. This odd behavior needs to be investigated more."),
"autogen_element_container_priority_footnote_definition_dynamic_block"=>Some("Apparently broken begin lines become their own paragraph."),
# Savannah does not allow fetching specific revisions, so we're going to have to put unnecessary load on their server by cloning main and then checking out the revision we want.
It might help analysis to record how often we start a specific type of parse for each character. For example, at the start of a plain list, if we had a count of how often each character was the start of a parse of a list we could use that to see how often that list is getting re-parsed.
* Optimizations
** Edit whitespace for list items
Whether or not a list item owns the trailing whitespace depends on if it is the last list item in that list. Since we do not know ahead of time if an item is the last item in the list, we have to either re-parse the list item or modify it after parsing.
*** For
We already are modifying the source of some elements after-the-fact with src_rust{set_source()} so this would be more of the same.
*** Against
I'd like to phase out such modifications because they seem hacky and fragile.
** Make detect element function
Some exit matchers are based on when the next element is found. Some elements do not need to be fully parsed to identify that they are a valid element. For example, src_org{1. foo} can already be identified as the start of a plain list (in the right context) without needing to parse the entire element.
*** For
Avoiding parsing the entire element for an exit matcher would reduce redundant parses.
*** Against
This adds code complexity and introduces the potential for bugs.
How many elements can be reasonably early-detected? For example, src_org{#+begin_src foo} is not enough to detect the start of a source block because without the src_org{#+end_src} it is just plain text.
** Grab multiple characters in plaintext parser before checking exit matcher
Currently we check the exit matcher after each character inside the plain text parser (and many others). Are there character sequences we can assume no exit matcher will trigger between? For example, a contiguous string of latin-alphabet letters?
*** For
This could significantly reduce our calls to exit matchers.
*** Against
I think targets would break this.
The exit matchers are already implicitly building this behavior since they should all exit very early when the starting character is wrong. Putting this logic in a centralized place, far away from where those characters are actually going to be used, is unfortunate for readability.
** Use exit matcher to cut off trailing whitespace instead of re-matching in plain lists.
# These are only allowed by configuring org-list-allow-alphabetical which the automated tests are not currently set up to do, so this will parse as a paragraph:
# The STARTUP directive here instructs org-mode to align tables which emacs normally does when opening the file. Since Organic is solely a parser, we have no business editing the org-mode document so Organic does not handle aligning tables, so in order for this test to pass, we have to avoid that behavior in Emacs.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.