The getters were a good idea, but if we are going to support editing later, we will need to expose the fields or write A LOT of boiler-plate. The getters also would prevent people from moving values out of the AST without even more boiler-plate. It is simply not worth it at this stage, so we will need to tolerate frequently changing semver versions as the public interface changes since *every* field in the AST is public.
This was written because I originally intended to make the fields of the ast node types entirely private, but that made constructing them tedious so they are pub(crate) which coincidentally also allows them to be used by the iterator.
Ultimately this is about semver and exposing a stable interface while allowing the internal representation to change. The fields are still pub(crate) to make constructing the types easier inside this crate, which should be fine because we can refactor the code inside this crate whenever the internal structure changes.
This implementation will reduce the use of heap by elimininating Box<> from the individual iterators but it will still need heap for maintaining a vector of iterators from nodes.
Previously we stepped through the document character by character which involved a lot of extra processing inside OrgSource. By scanning for possible keywords, we can skip many of the intermediate steps.
These functions aren't exposed to the public so we can confidently say that if they work in dev then they will work in production. Removing these asserts theoretically should result in a speedup.
Currently this is a copy of the Source trait but it will grow to more functions. The Source trait is restricted to this crate in anticipation of its removal in favor of StandardProperties.
It is possible to have two headlines that have the same level but different star counts when set to Odd because of rounding. Deciding nesting by star count instead of headline level avoids this issue.
This fixes an issue where lines in a paragraph were incorrectly getting identified as lists because I had defaulted to assuming alphabetical bullets were allowed.
Emacs will auto-align tables when org-mode is loaded if the document contains "#+STARTUP: align". Since Organic is just a parser, it has no business editing the input it receives so we are disabling this auto-align in Emacs to make the tests work properly.
This bug probably exists in hundreds of places across the code base. I am going to have to write a "fuzzer" that replaces random whitespace with tabs to find them all.
This test will grab documents from external sources and compare Organic's parser vs the official org-mode parser to ensure they are parsing the same. This is so we do not introduce large irrelevant documents in the git history and so we do not introduce documents with restrictive licenses into the repository.
This change should hopefully allow for matchers to have captured borrowed values, it should eliminate the use of heap-allocated reference counting on the context nodes, and it adds in a global settings struct for passing around values that do not change during parsing.
We were running into issues where the documents grew too large for being passed as a string to emacs, and we need to handle #+setupfile so we need to start handling org-mode documents as files and not just as anonymous streams of text.
The anonymous stream of text handling will remain because the automated tests use it.
This shows that greater blocks of different types can be directly nested even if they are both special blocks (as long as they are different special blocks).
We were re-parsing each footnote definition in the footnote definition exit matcher which causes their contents to get re-parsed. This compounds with long lists of footnote definitions.
It is currently unknown if this will produce a performance increase, but unless it has a significant performance penalty we are going to go forward with this change because it makes it more explicit which values need to be read deeply from other elements (therefore needing to be in the context) vs values that can be bound to the exit matcher since they are only used within the confines of the current element.
I suspect we will get a performance boost since it will be reducing the nodes that need to be walked in the context but maintaining bracket depth count over the entire document instead of only inside elements that need balanced brackets could cost us.
The documentation currently states that the body for these cannot span more than three lines but that is not the behavior I am seeing from emacs in practice. Waiting on a mailing list response to tell me if this is a documentation error or a parser error.
This was previously using the standard docker makefile I use as a starting point for all of my docker makefiles. Now it will properly mount the source directory.
This is not meant to produce publishable or comparable benchmarks. Such a script would have to run many iterations with the input already loaded into memory, proper prioritization via nice/ionice, and have a warm-up phase. This is just automating a basic test I am frequently running to compare parse times when investigating performance issues.
Organic is an emacs-less implementation of an [org-mode](https://orgmode.org/) parser.
Organic is an emacs-less implementation of an [org-mode](https://orgmode.org/) parser.
## Project Status
## Project Status
This project is a personal learning project to grow my experience in [rust](https://www.rust-lang.org/). It is under development and at this time I would not recommend anyone use this code. The goal is to turn this into a project others can use, at which point more information will appear in this README.
This project is still under HEAVY development. While the version remains v0.1.x the API will be changing often. Once we hit v0.2.x we will start following semver.
Currently, the parser is able to correctly identify the start/end bounds of all the org-mode objects and elements (except table.el tables, org-mode tables are supported) but many of the interior properties are not yet populated.
### Project Goals
- We aim to provide perfect parity with the emacs org-mode parser. In that regard, any document that parses differently between Emacs and Organic is considered a bug.
- The parser should be fast. We're not doing anything special, but since this is written in Rust and natively compiled we should be able to beat the existing parsers.
- The parser should have minimal dependencies. This should reduce effort w.r.t.: security audits, legal compliance, portability.
- The parser should be usable everywhere. In the interest of getting org-mode used in as many places as possible, this parser should be usable by everyone everywhere. This means:
- It must have a permissive license for use in proprietary code bases.
- We will investigate compiling to WASM. This is an important goal of the project and will definitely happen, but only after the parser has a more stable API.
- We will investigate compiling to a C library for native linking to other code. This is more of a maybe-goal for the project.
### Project Non-Goals
- This project will not include an elisp engine since that would drastically increase the complexity of the code. Any features requiring an elisp engine will not be implemented (for example, Emacs supports embedded eval expressions in documents but this parser will never support that).
- This project is exclusively an org-mode **parser**. This limits its scope to roughly the output of `(org-element-parse-buffer)`. It will not render org-mode documents in other formats like HTML or LaTeX.
### Project Maybe-Goals
- table.el support. Currently we support org-mode tables but org-mode also allows table.el tables. So far, their use in org-mode documents seems rather uncommon so this is a low-priority feature.
- Document editing support. I do not anticipate any advanced editing features to make editing ergonomic, but it should be relatively easy to be able to parse an org-mode document and serialize it back into org-mode. This would enable cool features to be built on top of the library like auto-formatters. To accomplish this feature, We'd have to capture all of the various separators and whitespace that we are currently simply throwing away. This would add many additional fields to the parsed structs and it would add more noise to the parsers themselves, so I do not want to approach this feature until the parser is more complete since it would make modifications and refactoring more difficult.
### Supported Versions
This project targets the version of Emacs and Org-mode that are built into the [organic-test docker image](docker/organic_test/Dockerfile). This is newer than the version of Org-mode that shipped with Emacs 29.1. The parser itself does not depend on Emacs or Org-mode though, so this only matters for development purposes when running the automated tests that compare against upstream Org-mode.
## Using this library
TODO: Add section on using Organic as a library (which is the intended use for this project). This will be added when we have a bit more API stability since currently the library is under heavy development.
## Development
### The parse binary
This program takes org-mode input either streamed in on stdin or as paths to files passed in as arguments. It then parses them using Organic and dumps the result to stdout. This program is intended solely as a development tool. Examples:
This program takes org-mode input either streamed in on stdin or as paths to files passed in as arguments. It then parses them using Organic and the official Emacs Org-mode parser and compares the parse result. This program is intended solely as a development tool. Since org-mode is a moving target, it is recommended that you run this through docker since we pin the version of org-mode to a specific revision. Examples:
There are three levels of tests for this repository: the standard tests, the autogenerated tests, and the foreign document tests.
### The standard tests
These are regular hand-written rust tests. These can be run with:
```bash
make unittest
```
### The auto-generated tests
These tests are automatically generated from the files in the `org_mode_samples` directory and they are still integrated with the rust/cargo testing framework. For each org-mode document in that folder, a test is generated that will parse the document with both Organic and the official Emacs Org-mode parser and then it will compare the parse results. Any deviation is considered a failure. Since org-mode is a moving target, it is recommended that you run these tests inside docker since the `organic-test` docker image is pinned to a specific revision of org-mode. These can be run with:
```bash
make dockertest
```
### The foreign document tests
These tests function the same as the auto-generated tests except they are **not** integrated with the rust/cargo testing framework and they involve comparing the parse of org-mode documents that live outside this repository. This allows us to test against a far greater variety of org-mode input documents without pulling massive sets of org-mode documents into this repository. The recommended way to run these tests is still through docker because it pins org-mode and the test documents to specific git revisions. These can be run with:
```bash
make foreign_document_test
```
## License
## License
This project is released under the public-domain-equivalent [0BSD license](https://www.tldrlegal.com/license/bsd-0-clause-license). This license puts no restrictions on the use of this code (you do not even have to include the copyright notice or license text when using it). HOWEVER, this project has a couple permissively licensed dependencies which do require their copyright notices and/or license texts to be included. I am not a lawyer and this is not legal advice but it is my layperson's understanding that if you distribute a binary with this library linked in, you will need to abide by their terms since their code will also be linked in your binary. I try to keep the dependencies to a minimum and the most restrictive dependency I will ever include is a permissively licensed one.
This project is released under the public-domain-equivalent [0BSD license](https://www.tldrlegal.com/license/bsd-0-clause-license), however, this project has a couple permissively licensed non-public-domain-equivalent dependencies which require their copyright notices and/or license texts to be included. I am not a lawyer and this is not legal advice but it is my layperson's understanding that if you distribute a binary statically linking this library, you will need to abide by their terms since their code will also be linked in your binary.
"autogen_greater_element_drawer_drawer_with_headline_inside"=>Some("Apparently lines with :end: become their own paragraph. This odd behavior needs to be investigated more."),
"greater_element_drawer_drawer_with_headline_inside"=>Some("Apparently lines with :end: become their own paragraph. This odd behavior needs to be investigated more."),
"autogen_element_container_priority_footnote_definition_dynamic_block"=>Some("Apparently broken begin lines become their own paragraph."),
"element_container_priority_footnote_definition_dynamic_block"=>Some("Apparently broken begin lines become their own paragraph."),
"autogen_lesser_element_paragraphs_paragraph_with_backslash_line_breaks"=>Some("The text we're getting out of the parse tree is already processed to remove line breaks, so our comparison needs to take that into account."),
# Savannah does not allow fetching specific revisions, so we're going to have to put unnecessary load on their server by cloning main and then checking out the revision we want.
# Savannah does not allow fetching specific revisions, so we're going to have to put unnecessary load on their server by cloning main and then checking out the revision we want.
@@ -25,8 +25,83 @@ RUN make compile
RUN make DESTDIR="/root/dist" install
RUN make DESTDIR="/root/dist" install
FROMrustlang/rust:nightly-alpine3.17
FROMrustlang/rust:nightly-alpine3.17AStester
RUN apk add --no-cache musl-dev ncurses gnutls
ENVLANG=en_US.UTF-8
RUN apk add --no-cache musl-dev ncurses gnutls libgccjit
RUN cargo install --locked --no-default-features --features ci-autoclean cargo-cache
RUN cargo install --locked --no-default-features --features ci-autoclean cargo-cache
@@ -25,3 +25,4 @@ This could significantly reduce our calls to exit matchers.
I think targets would break this.
I think targets would break this.
The exit matchers are already implicitly building this behavior since they should all exit very early when the starting character is wrong. Putting this logic in a centralized place, far away from where those characters are actually going to be used, is unfortunate for readability.
The exit matchers are already implicitly building this behavior since they should all exit very early when the starting character is wrong. Putting this logic in a centralized place, far away from where those characters are actually going to be used, is unfortunate for readability.
** Use exit matcher to cut off trailing whitespace instead of re-matching in plain lists.
The autogen tests are the tests automatically generated to compare the output of Organic vs the upstream Emacs Org-mode parser using the sample documents in the =org_mode_samples= folder. They will have a prefix based on the settings for each test.
- default :: The test is run with the default settings (The upstream Emacs Org-mode determines the default settings)
- la :: Short for "list alphabetic". Enables alphabetic plain lists.
- t# :: Sets the tab-width to # (as in t4 sets the tab-width to 4).
- odd :: Sets the org-odd-levels-only setting to true (meaning "odd" as opposed to "oddeven").
# These are only allowed by configuring org-list-allow-alphabetical which the automated tests are not currently set up to do, so this will parse as a paragraph:
# "lorem" is prefixed by a tab instead of spaces, so the editor's tab-width value determines whether lorem is a sibling of baz (tab-width 8), a sibling of bar (tab-width < 8), or a child of baz (tab-width > 8).
# The STARTUP directive here instructs org-mode to align tables which emacs normally does when opening the file. Since Organic is solely a parser, we have no business editing the org-mode document so Organic does not handle aligning tables, so in order for this test to pass, we have to avoid that behavior in Emacs.
# Even though *exporting* honors the setting to require braces for subscript/superscript, the official org-mode parser still parses subscripts and superscripts.
#+OPTIONS: ^:{}
foo_this isn't a subscript when exported due to lack of braces (but its still a subscript during parsing)
# @ : Log changes leading to this state and prompt for a comment to include.
# /! : Log changes leaving this state if and only if to a state that does not log. This can be combined with the above like WAIT(w!/!) or DELAYED(d@/!)
* INPROGRESS
- State "TODO" from "INPROGRESS" [2023-09-14 Thu 02:13]
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.